Linking Data to Publications through Citation and Virtual Archives Micah Altman, Institute for Quantitative Social Science, Harvard University Prepared for the 2011 SSP 33rd Annual Meeting June 2011
May 06, 2015
Linking Data to Publications through Citation and Virtual Archives
Micah Altman, Institute for Quantitative Social Science, Harvard University
Prepared for the 2011 SSP 33rd Annual MeetingJune 2011
Collaborators*
Linking Data to Publications through Citation and Virtual Archives
Leonid Andreev, Ed Bachman, Adam Buchbinder, Ken Bollen, Bryan Beecher, Steve Burling, Kevin Condon, Jonathan Crabtree, Merce Crosas, Gary King, Patrick King, Tom Lipkis, Freeman Lo, Jared Lyle, Marc Maynard, Nancy McGovern, Lois Timms-Ferrarra, Akio Sone, Bob Treacy
Research SupportThanks to the Library of Congress (PA#NDP03-1), the
National Science Foundation (DMS-0835500, SES 0112072), IMLS (LG-05-09-0041-09), the Harvard University Library, the Institute for Quantitative Social Science, the Harvard-MIT Data Center, and the Murray Research Archive.
* And co-conspirators
Related Work
Linking Data to Publications through Citation and Virtual Archives
Reprints available from: http://maltman.hmdc.harvard.edu
M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009. "Digital preservation through archival collaboration: The Data Preservation Alliance for the Social Sciences." The American Archivist. 72(1): 169-182
M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”, D-Lib, 13, 3/4 (March/April).
M. Altman,2008, "A Fingerprint Method for Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007) , Springer Verlag.
M. Crosas, 2011, “The Dataverse Network: An Open-Source Application for Sharing, Discovering and Preserving Data”, D-Lib Magazine 17(1/2).
G. King, 2007, " An Introduction to the Dataverse Network as an Infrastructure for Data Sharing", Sociological Methods and Research, Vol. 32, No. 2, pp. 173-199
Roadmap
Motivations Elements of data management Citing Data Virtual Archives
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Data Access is Key to Science
Linking Data to Publications through Citation and Virtual Archives
Science is not (only) about being scientific Scientific progress requires community:
competition and collaboration in the pursuit of common goals Without access to the same materials:
no community exists
… data is the nucleus of scientific collaboration
The value of an article that can’t be replicated: ? Scholarly articles are summaries, not the actual research results Experimental expensive to reproduce, observational data
impossible Hard for journal editors to verify --
If you find it, how do you know it’s the same? Replication projects show: many published articles cannot be
replicated
… data is needed for scientific replication
Motivations Elements
Citing Data Virtual Archives
Data is Key to Democracy
Linking Data to Publications through Citation and Virtual Archives
Statistics = state-istics The state tax authority:
counting people, estimating wealth
Reformers use data to assess the performance of the state
Science informs public policy continually
In modern democracy: the public needs a direct source of information
Source: “Propaganda” http://www.media-studies.ca/articles/images/berlin_wall.jpg
Motivations Elements
Citing Data Virtual Archives
Open Data Enables New Forms of Science and Education
Linking Data to Publications through Citation and Virtual Archives
Data Intensive Science Increased opportunities for
interdisciplinarity Science modeling reality across multiple
scales Continuous, complete, fine-grained
information on physical processes, systems, human behavior
Open Data Democratizes Science Citizen-scientist Developing countries Institutions outside of the inner circle of
research Education
Open data eases transition from education to research
In addition, sharing data increases citation rates[Gleditsch 2003; Wilson 2008; Piowar 2007]
Visualization from multiple experiments using Community Climate Systems Model, through Earth Science Grid.Source:“Beyond Being There”, National Science Foundation, 2008.
Motivations Elements
Citing Data Virtual Archives
Science Model
“Unpublished data and personal communications Citations to unpublished data and personal communications cannot be used to support claims in a published paper.”
“Data and materials availability All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science.”
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Some Formal Requirements The Final NIH Statement on Sharing Research Data
was published in the NIH Guide on February 26, 2003.“Starting with the October 1, 2003 receipt date, investigators submitting an NIH application seeking $500,000 or more in direct costs in any single year are expected to include a plan for data sharing or state why data sharing is not possible. “
No later than the main findings from the final data set are accepted for publication
NSF, All proposals must (as of 1/1/2011) include a data management plan. Specific requirements vague, for the most part:
“will be determined by the community of interest through the process of peer review and program management.”
Wellcome Trust: “ will review data management and sharing plans, and
any costs involved in delivering them, as an integral part of the funding decision”
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Data Management Plans Safeguarding data for internal use in project
Documentation Backup and recovery Review
Treatment of confidential and rights-encumbered information Consent to disclosure Overview:
http://www.icpsr.org/DATAPASS/pdf/confidentiality.pdf Separation of identifying and sensitive information Obtain certificate of confidentiality, other legal safeguards De-identification and public use files Licensing
Dissemination Archiving commitment (include letter of support) Archiving timeline Access procedures Documentation User vetting, tracking, and support Licenses and restriction
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Data Management Elements
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Access and Sharing
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Organization and Documentation
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
DMP Operational Issues
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Rights and Responsibilities
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Why is Infrastructure for Data Sharing Necessary?
Linking Data to Publications through Citation and Virtual Archives
Accessibility: Many large data sets: in public archives Most data in published articles:
not accessible, results not replicable without the original author Most data sets from federal grants: not publicly available
Problems with discovery and linking even with professional archives: Data in different archives have different identifiers Archives change identifiers, links Changes to data are made; identifiers are reused or removed; old data are
lost Locating/browsing/extracting requires specialized tools & approaches
Sharing data requires exposing tacit knowledge Explicit documentation of data structure, collection process, interpretation Harmonizing/linking to known ontologies, metadata schemas, vocabularies
Data sets are not preserved like books Static data files (even if on the web): unreadable after a few years When storage methods change: some data sets are lost; others have
altered content! Why not Single Centralized infrastructure ?
Single point of failure Impossible when data are heterogeneous in format, origin, size, effort
needed to collect or analyze, legal access rules, etc. Data producers want credit, control, and visibility
Motivations Elements
Citing Data Virtual Archives
Core Requirements for Data Sharing Infrastructure
Linking Data to Publications through Citation and Virtual Archives
Stakeholder incentives recognition; citation; payment; compliance; services
Dissemination access to metadata; documentation; data
Access control authentication; authorization; rights management
Provenance chain of control; verification of metadata, bits, semantic
content Persistence
bits; semantic content; use Legal protection
rights management; consent; record keeping; auditing Usability
discovery; deposit; curation; administration; collaboration Business modelSources: King 2007; ICSU 2004; NSB 2005
Motivations Elements
Citing Data Virtual Archives
Data Citation as a Leverage Point Services
Identifiers to specific fixed versions of data are needed to establish unambiguous chains of provenance
Identifiers that can be globally resolved to machine-understandable metadata and to identified object are needed to building generalized access and analysis services
Persistence of identifiers are needed to maintain long-term access
Incentives Scholarly credit (intellectual attribution) is a large motivator
for many researchers – citation creates incentive for researchers to publish data
Scholars also comply with enforceable journal policies-- requiring data citation is a light-weight method to make data access policies auditable
Impact/usage is a motivator for public research funders – data citation provides foundation for measures of usage and impact
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Emerging Practices for Data Citation
Linking Data to Publications through Citation and Virtual Archives
Publishers Data archives Standard bodies Librarians Discipline specific
standards
Motivations Elements
Citing Data Virtual Archives
ORCID Participant Meeting: Data Citations and The DataVerse Network (R)
Com
mon
Prin
cipl
es
Thanks to 37 Participants
Linking Data to Publications through Citation and Virtual Archives
Motivations Elements
Citing Data Virtual Archives
Linking Data to Publications through Citation and Virtual Archives
Seven Ways of Looking at Data
^Supplementary
AKA
Linking Data to Publications through Citation and Virtual Archives
Theory
Linking Data to Publications through Citation and Virtual Archives
Theory +
Data citations should be first class objects for publication -- appear with citation; should be as easy to cite as other works
At minimum, all data necessary to understand assess extend conclusions in scholarly work should be cited
Citations should persist and enable access to fixed version of data at least as long as citing work
Data citation should support unambiguous attribution of credit to all contributors, possibly through the citation ecosystem
Theory + Practice
Linking Data to Publications through Citation and Virtual Archives
Use Cases
Linking Data to Publications through Citation and Virtual Archives
Use Cases
For Organizations For Scholars
•Brand it like your own website.•Upload any type of data.•Establish a persistent data citation•Facilitate data discovery•Provide live analysis •Receive permanent storage space
•Used by archives, libraries, journals, schools•Enable contributors to upload data•Organize studies by collections•Search across a universe of data•Control access and terms of use•Federate with catalogs and partners: OAI-PMH, LOCKSS, Z39.50, DDI
Linking Data to Publications through Citation and Virtual Archives
Dataverse
Federated Archive: National Research Portal
Linking Data to Publications through Citation and Virtual Archives
Preserve data Provide access to
local data Organize universe
of data
Virtual Archive: Library Catalog & Repository
Linking Data to Publications through Citation and Virtual Archives
Pathfinder Virtual collection Manages licensed
works Institutional
repository
Federated + Virtual: DataPass Union Catalog Data-PASS uses DataVerse:
Creates federated catalog Manages content for some
partners Provides simple way for
organizations to participate in partnership
Data-PASS uses SafeArchive: Collaboration through
mutual replication of partner content
Supports legal transfer agreements
SafeArchive + LOCKSS + Dataverse = Policy based replicated data archives
Linking Data to Publications through Citation and Virtual Archives
Journal Replication Archives
Linking Data to Publications through Citation and Virtual Archives
Support publication workflows
Permanent, branded supplementary materials repository
Treats data as first class objects – provides identifiers and services
Virtual Archive: Scholar Site
Linking Data to Publications through Citation and Virtual Archives
Scholar retains control over branding and dissemination
Preservation and long-term access is guaranteed
Dissemination and compliance with Data Manage Plans is verifiable
Integrates with OpenScholar
Dataverse Network – Designed for Research Data
Linking Data to Publications through Citation and Virtual Archives
Summary Data Management Include
Safeguarding data for use Protecting rights and confidentiality Short term and long term dissemination Many technical issue Many institutional models
Citation provides leverage for incentives and services Data citations should be first class objects for publication --
appear with citation; should be as easy to cite as other works At minimum, all data necessary to understand assess extend
conclusions in scholarly work should be cited Citations should persist and enable access to fixed version of
data at least as long as citing work Data citation should support unambiguous attribution of
credit to all contributors, possibly through the citation ecosystem
Virtual archiving is one successful model for publication related data archiving and primary data archiving
Linking Data to Publications through Citation and Virtual Archives
Contact Us
Linking Data to Publications through Citation and Virtual Archives
Micah Altman
maltman.hmdc.harvard.edu
The Dataverse Network ™
thedata.org