Top Banner
The Expanding Dataverse Mercè Crosas, Director of Data Science, IQSS @mercecrosas January 21, 2015, Lamont Library, Harvard University
31
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The expanding dataverse

The Expanding

Dataverse

Mercè Crosas, Director of Data Science, IQSS@mercecrosas

January 21, 2015, Lamont Library, Harvard University

Page 2: The expanding dataverse

Data Publishing: A form of

Scholarly Communication

350 years

of scientific

publishing,

with words

and data

1665 Data, if any, were part of the printed publication

NowVast quantities of digital data (and code) cannot

be part of the printed publication

Page 3: The expanding dataverse

Pillars of Data Publishing

To make data discoverable, accessible and

reusable, we need:

1. Data Citation, to reference and find data

2. Data Repositories, to host and access data

3. Information about the data, to understand

and reuse them

Page 4: The expanding dataverse

Dataverse Software:

A Data Publishing framework

… for a wide range of repositories

Public, Generic

Repositories

Institutional

Repositories

Curated Data Archives

Repositories

Page 5: The expanding dataverse

http://dataverse.org

Page 6: The expanding dataverse

Dataverse 4.0: Enables and

Enhances Data Publishing

● A data citation compliant with the Data

Citation Principles

● Rich metadata to describe and find datasets

from multiple domains

● Support for public and restricted data,

open data license and terms of use

● Rigorous workflows to publish data, with

support for new versions of the data

Page 7: The expanding dataverse

Data Citation

Page 8: The expanding dataverse

A Brief History of Citing Data

1906Chicago Manual of Style:

author/creator, title, dates,

publisher or distributor

1979ASBR (“Data File” type)

MARC (machine readable catalog)

Domain Repositories

(e.g., GenBank)

1959First scientific digital repositories

(e.g. World Data Center, ICPSR)

1999 - NowGrowth of Data Repositories

(e.g., NESSTAR, Dataverse,

Dryad, Figshare, Zenodo)

DOI services for Data

(e.g., DataCite in 2009)

Altman & Crosas, 2013, “The Evolution of Data Citation: From Principles to Implementation” IASSIST Quarterly

2014 Data Citation

Principles

NISO-JATS

revised to

support data

Page 9: The expanding dataverse

Joint Declaration of Data

Citation Principles

1 Importance

2 Credit and Attribution

3 Evidence

4 Unique Identification

5 Access

6 Persistence

7 Specificity and Verifiability

8 Interoperability and flexibilityhttps://www.force11.org/datacitation

Page 10: The expanding dataverse

Data Citation generated by

Dataverse

Principle 2:

Credit and Attribution

Principle 4, 5, 6:

Unique Id Access

Persistence

Principle 7:

Specificity and Verifiability

Principle 8: Interoperability and flexibility:

Repository exports citation metadata in XML, JSON formats

Authors, Year, Dataset Title, DOI, Data Repository, UNF, version

Resolves to landing page with access to

metadata, docs, and data

Altman & King, 2007. A Proposed Standard for the Scholarly Citation of Quantitative Data.

Page 11: The expanding dataverse

A rigorous

Metadata

Page 12: The expanding dataverse
Page 13: The expanding dataverse

Three Metadata Levels

Generic Metadata Domain Specific

MetadataFile Metadata

Includes data

citation metadata

fields (Examples:

title, authors,

persistent id,

description)

Examples:

● Social Science

Metadata (DDI)

● Life Sciences

(ISA-Tab)

● Astronomy (VO)

Examples (automatic):

● For Tabular Files:

Column information

● For FITS Files:

Header information

Page 14: The expanding dataverse

Life Science Metadata

Example: Life Sciences Metadata

Page 15: The expanding dataverse

Example: Astronomy Metadata

Page 16: The expanding dataverse

Public vs

Restricted

Page 17: The expanding dataverse

Terms, Licenses and

Restrictions

Public Dataset Dataset with

Restricted Files

Dataset with

Terms of Use

● CC0 License

● Metadata is public

● Files are public

● CC0 License

● Metadata is public

● Files are restricted

● Access Terms are

defined in dataset

● Metadata is public

● Terms of Use are

defined in dataset

(CC0 can’t apply)

● Files might be public

or restricted

Page 18: The expanding dataverse

Workflows

Page 19: The expanding dataverse

Draft, Published and

Versions

Draft DatasetPublished

Dataset, v1

Published

Dataset, v1.1

Published

Dataset, v2

Upload

Data

Dataset in review,

can be shared with

collaborators

Once published,

dataset cannot be

unpublished (only

deaccessioned)

Minor version for

small changes to

dataset description

Major version for

new versions of

data files

Data Citation

becomes publicData Citation

doesn’t change

Data Citation

changes

Draft Draft

Page 20: The expanding dataverse

Multiple Roles for

Multiple Workflows

Editor

Upload Data +

Edit Metadata

Set File Restrictions +

License and Terms

Grant Access +

Publish Dataset

Upload Data +

Edit Metadata

Upload Data +

Edit Metadata

Set File Restrictions +

License and Terms

Manager

+

+ +Curator

+ Custom Roles

Page 21: The expanding dataverse

Data Processing,

Analysis, and

Visualizations

Page 22: The expanding dataverse

Tabular Data: Converted to Preservation format

Download in Original format or

Preservation format (does not

depend on software package)

Page 23: The expanding dataverse

Tabular Data: Explore and Analyze with TwoRavens

Page 24: The expanding dataverse

Geospatial Data: Visualize in WorldMap

Page 25: The expanding dataverse

Demo acknowledgement: Dwayne Liburd, Sonia Barbosa

Page 26: The expanding dataverse

Not only Expanding in

Features, but also in Size

874 Dataverses

55,539 Datasets

1,173,733 Downloads

Page 27: The expanding dataverse

What’s coming

Page 28: The expanding dataverse
Page 29: The expanding dataverse

Beyond 4.0● Integration with other Systems:

o DASH

o ORCID

o Journal Systems (in addition to OJS)

o Archivematica

o iRODS

● Support for Sensitive Data:o Secure Storage

o DataTags

o Analysis with Privacy Preserving Algorithms

● Data Citation with Dataset Provenance

● Expanding APIs!

Page 30: The expanding dataverse
Page 31: The expanding dataverse

A rigorous

Thank You

[email protected]

@mercecrosas

http://datascience.iq.harvard.edu/team