Top Banner
Linking Data to Publications through Citation and Virtual Archives Micah Altman, Institute for Quantitative Social Science, Harvard University Prepared for the 2011 SSP 33rd Annual Meeting June 2011
36

Linking Data to Publications through Citation and Virtual Archives

May 06, 2015

Download

Technology

Micah Altman

Prepared for the 2011 SSP 33rd Annual Meeting June 2011
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linking Data to Publications through Citation and Virtual Archives

Linking Data to Publications through Citation and Virtual Archives

Micah Altman, Institute for Quantitative Social Science, Harvard University

Prepared for the 2011 SSP 33rd Annual MeetingJune 2011

Page 2: Linking Data to Publications through Citation and Virtual Archives

Collaborators*

Linking Data to Publications through Citation and Virtual Archives

Leonid Andreev, Ed Bachman, Adam Buchbinder, Ken Bollen, Bryan Beecher, Steve Burling, Kevin Condon, Jonathan Crabtree, Merce Crosas, Gary King, Patrick King, Tom Lipkis, Freeman Lo, Jared Lyle, Marc Maynard, Nancy McGovern, Lois Timms-Ferrarra, Akio Sone, Bob Treacy

Research SupportThanks to the Library of Congress (PA#NDP03-1), the

National Science Foundation (DMS-0835500, SES 0112072), IMLS (LG-05-09-0041-09), the Harvard University Library, the Institute for Quantitative Social Science, the Harvard-MIT Data Center, and the Murray Research Archive.

* And co-conspirators

Page 3: Linking Data to Publications through Citation and Virtual Archives

Related Work

Linking Data to Publications through Citation and Virtual Archives

Reprints available from: http://maltman.hmdc.harvard.edu

M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009. "Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences." The American Archivist. 72(1): 169-182

M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4 (March/April).

M. Altman,2008, "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag.

M. Crosas, 2011, “The Dataverse Network: An Open-Source Application for Sharing, Discovering and Preserving Data”, D-Lib Magazine 17(1/2).

G. King, 2007, " An Introduction to the Dataverse Network as an Infrastructure for Data Sharing", Sociological Methods and Research, Vol. 32, No. 2, pp. 173-199

Page 4: Linking Data to Publications through Citation and Virtual Archives

Roadmap

Motivations Elements of data management Citing Data Virtual Archives

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 5: Linking Data to Publications through Citation and Virtual Archives

Data Access is Key to Science

Linking Data to Publications through Citation and Virtual Archives

Science is not (only) about being scientific Scientific progress requires community:

competition and collaboration in the pursuit of common goals Without access to the same materials:

no community exists

… data is the nucleus of scientific collaboration

The value of an article that can’t be replicated: ? Scholarly articles are summaries, not the actual research results Experimental expensive to reproduce, observational data

impossible Hard for journal editors to verify --

If you find it, how do you know it’s the same? Replication projects show: many published articles cannot be

replicated

… data is needed for scientific replication

Motivations Elements

Citing Data Virtual Archives

Page 6: Linking Data to Publications through Citation and Virtual Archives

Data is Key to Democracy

Linking Data to Publications through Citation and Virtual Archives

Statistics = state-istics The state tax authority:

counting people, estimating wealth

Reformers use data to assess the performance of the state

Science informs public policy continually

In modern democracy: the public needs a direct source of information

Source: “Propaganda” http://www.media-studies.ca/articles/images/berlin_wall.jpg

Motivations Elements

Citing Data Virtual Archives

Page 7: Linking Data to Publications through Citation and Virtual Archives

Open Data Enables New Forms of Science and Education

Linking Data to Publications through Citation and Virtual Archives

Data Intensive Science Increased opportunities for

interdisciplinarity Science modeling reality across multiple

scales Continuous, complete, fine-grained

information on physical processes, systems, human behavior

Open Data Democratizes Science Citizen-scientist Developing countries Institutions outside of the inner circle of

research Education

Open data eases transition from education to research

In addition, sharing data increases citation rates[Gleditsch 2003; Wilson 2008; Piowar 2007]

Visualization from multiple experiments using Community Climate Systems Model, through Earth Science Grid.Source:“Beyond Being There”, National Science Foundation, 2008.

Motivations Elements

Citing Data Virtual Archives

Page 8: Linking Data to Publications through Citation and Virtual Archives

Science Model

“Unpublished data and personal communications Citations to unpublished data and personal communications cannot be used to support claims in a published paper.”

“Data and materials availability All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science.”

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 9: Linking Data to Publications through Citation and Virtual Archives

Some Formal Requirements The Final NIH Statement on Sharing Research Data

was published in the NIH Guide on February 26, 2003.“Starting with the October 1, 2003 receipt date, investigators submitting an NIH application seeking $500,000 or more in direct costs in any single year are expected to include a plan for data sharing or state why data sharing is not possible. “

No later than the main findings from the final data set are accepted for publication

NSF, All proposals must (as of 1/1/2011) include a data management plan. Specific requirements vague, for the most part:

“will be determined by the community of interest through the process of peer review and program management.”

Wellcome Trust: “ will review data management and sharing plans, and

any costs involved in delivering them, as an integral part of the funding decision”

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 10: Linking Data to Publications through Citation and Virtual Archives

Data Management Plans Safeguarding data for internal use in project

Documentation Backup and recovery Review

Treatment of confidential and rights-encumbered information Consent to disclosure Overview:

http://www.icpsr.org/DATAPASS/pdf/confidentiality.pdf Separation of identifying and sensitive information Obtain certificate of confidentiality, other legal safeguards De-identification and public use files Licensing

Dissemination Archiving commitment (include letter of support) Archiving timeline Access procedures Documentation User vetting, tracking, and support Licenses and restriction

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 11: Linking Data to Publications through Citation and Virtual Archives

Data Management Elements

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 12: Linking Data to Publications through Citation and Virtual Archives

Access and Sharing

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 13: Linking Data to Publications through Citation and Virtual Archives

Organization and Documentation

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 14: Linking Data to Publications through Citation and Virtual Archives

DMP Operational Issues

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 15: Linking Data to Publications through Citation and Virtual Archives

Rights and Responsibilities

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 16: Linking Data to Publications through Citation and Virtual Archives

Why is Infrastructure for Data Sharing Necessary?

Linking Data to Publications through Citation and Virtual Archives

Accessibility: Many large data sets: in public archives Most data in published articles:

not accessible, results not replicable without the original author Most data sets from federal grants: not publicly available

Problems with discovery and linking even with professional archives: Data in different archives have different identifiers Archives change identifiers, links Changes to data are made; identifiers are reused or removed; old data are

lost Locating/browsing/extracting requires specialized tools & approaches

Sharing data requires exposing tacit knowledge Explicit documentation of data structure, collection process, interpretation Harmonizing/linking to known ontologies, metadata schemas, vocabularies

Data sets are not preserved like books Static data files (even if on the web): unreadable after a few years When storage methods change: some data sets are lost; others have

altered content! Why not Single Centralized infrastructure ?

Single point of failure Impossible when data are heterogeneous in format, origin, size, effort

needed to collect or analyze, legal access rules, etc. Data producers want credit, control, and visibility

Motivations Elements

Citing Data Virtual Archives

Page 17: Linking Data to Publications through Citation and Virtual Archives

Core Requirements for Data Sharing Infrastructure

Linking Data to Publications through Citation and Virtual Archives

Stakeholder incentives recognition; citation; payment; compliance; services

Dissemination access to metadata; documentation; data

Access control authentication; authorization; rights management

Provenance chain of control; verification of metadata, bits, semantic

content Persistence

bits; semantic content; use Legal protection

rights management; consent; record keeping; auditing Usability

discovery; deposit; curation; administration; collaboration Business modelSources: King 2007; ICSU 2004; NSB 2005

Motivations Elements

Citing Data Virtual Archives

Page 18: Linking Data to Publications through Citation and Virtual Archives

Data Citation as a Leverage Point Services

Identifiers to specific fixed versions of data are needed to establish unambiguous chains of provenance

Identifiers that can be globally resolved to machine-understandable metadata and to identified object are needed to building generalized access and analysis services

Persistence of identifiers are needed to maintain long-term access

Incentives Scholarly credit (intellectual attribution) is a large motivator

for many researchers – citation creates incentive for researchers to publish data

Scholars also comply with enforceable journal policies-- requiring data citation is a light-weight method to make data access policies auditable

Impact/usage is a motivator for public research funders – data citation provides foundation for measures of usage and impact

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 19: Linking Data to Publications through Citation and Virtual Archives

Emerging Practices for Data Citation

Linking Data to Publications through Citation and Virtual Archives

Publishers Data archives Standard bodies Librarians Discipline specific

standards

Motivations Elements

Citing Data Virtual Archives

Page 20: Linking Data to Publications through Citation and Virtual Archives

ORCID Participant Meeting: Data Citations and The DataVerse Network (R)

Com

mon

Prin

cipl

es

Page 21: Linking Data to Publications through Citation and Virtual Archives

Thanks to 37 Participants

Linking Data to Publications through Citation and Virtual Archives

Motivations Elements

Citing Data Virtual Archives

Page 22: Linking Data to Publications through Citation and Virtual Archives

Linking Data to Publications through Citation and Virtual Archives

Seven Ways of Looking at Data

^Supplementary

AKA

Page 23: Linking Data to Publications through Citation and Virtual Archives

Linking Data to Publications through Citation and Virtual Archives

Theory

Page 24: Linking Data to Publications through Citation and Virtual Archives

Linking Data to Publications through Citation and Virtual Archives

Theory +

Data citations should be first class objects for publication -- appear with citation; should be as easy to cite as other works

At minimum, all data necessary to understand assess extend conclusions in scholarly work should be cited

Citations should persist and enable access to fixed version of data at least as long as citing work

Data citation should support unambiguous attribution of credit to all contributors, possibly through the citation ecosystem

Page 25: Linking Data to Publications through Citation and Virtual Archives

Theory + Practice

Page 26: Linking Data to Publications through Citation and Virtual Archives

Linking Data to Publications through Citation and Virtual Archives

Use Cases

Page 27: Linking Data to Publications through Citation and Virtual Archives

Linking Data to Publications through Citation and Virtual Archives

Use Cases

Page 28: Linking Data to Publications through Citation and Virtual Archives

For Organizations For Scholars

•Brand it like your own website.•Upload any type of data.•Establish a persistent data citation•Facilitate data discovery•Provide live analysis •Receive permanent storage space

•Used by archives, libraries, journals, schools•Enable contributors to upload data•Organize studies by collections•Search across a universe of data•Control access and terms of use•Federate with catalogs and partners: OAI-PMH, LOCKSS, Z39.50, DDI

Linking Data to Publications through Citation and Virtual Archives

Dataverse

Page 29: Linking Data to Publications through Citation and Virtual Archives

Federated Archive: National Research Portal

Linking Data to Publications through Citation and Virtual Archives

Preserve data Provide access to

local data Organize universe

of data

Page 30: Linking Data to Publications through Citation and Virtual Archives

Virtual Archive: Library Catalog & Repository

Linking Data to Publications through Citation and Virtual Archives

Pathfinder Virtual collection Manages licensed

works Institutional

repository

Page 31: Linking Data to Publications through Citation and Virtual Archives

Federated + Virtual: DataPass Union Catalog Data-PASS uses DataVerse:

Creates federated catalog Manages content for some

partners Provides simple way for

organizations to participate in partnership

Data-PASS uses SafeArchive: Collaboration through

mutual replication of partner content

Supports legal transfer agreements

SafeArchive + LOCKSS + Dataverse = Policy based replicated data archives

Linking Data to Publications through Citation and Virtual Archives

Page 32: Linking Data to Publications through Citation and Virtual Archives

Journal Replication Archives

Linking Data to Publications through Citation and Virtual Archives

Support publication workflows

Permanent, branded supplementary materials repository

Treats data as first class objects – provides identifiers and services

Page 33: Linking Data to Publications through Citation and Virtual Archives

Virtual Archive: Scholar Site

Linking Data to Publications through Citation and Virtual Archives

Scholar retains control over branding and dissemination

Preservation and long-term access is guaranteed

Dissemination and compliance with Data Manage Plans is verifiable

Integrates with OpenScholar

Page 34: Linking Data to Publications through Citation and Virtual Archives

Dataverse Network – Designed for Research Data

Linking Data to Publications through Citation and Virtual Archives

Page 35: Linking Data to Publications through Citation and Virtual Archives

Summary Data Management Include

Safeguarding data for use Protecting rights and confidentiality Short term and long term dissemination Many technical issue Many institutional models

Citation provides leverage for incentives and services Data citations should be first class objects for publication --

appear with citation; should be as easy to cite as other works At minimum, all data necessary to understand assess extend

conclusions in scholarly work should be cited Citations should persist and enable access to fixed version of

data at least as long as citing work Data citation should support unambiguous attribution of

credit to all contributors, possibly through the citation ecosystem

Virtual archiving is one successful model for publication related data archiving and primary data archiving

Linking Data to Publications through Citation and Virtual Archives

Page 36: Linking Data to Publications through Citation and Virtual Archives

Contact Us

Linking Data to Publications through Citation and Virtual Archives

Micah Altman

maltman.hmdc.harvard.edu

The Dataverse Network ™

thedata.org