Top Banner
Versioning of Data Sets: Why, How, What and Where? Lesley Wyborn, NCI Jens Klump, CSIRO Adrian Burton, ANDS This slides is available at: http://bit.ly/2dDmXHE
25

Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Apr 20, 2018

Download

Documents

hoàng_Điệp
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Versioning of Data Sets: Why, How, What and Where?

Lesley Wyborn, NCIJens Klump, CSIRO

Adrian Burton, ANDS

This slides is available at: http://bit.ly/2dDmXHE

Page 2: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Outline

● Group introductions (Lesley - facilitator, ~2mins)● Why - the growing need for data versioning (Lesley, 5 mins)● How and what (Jens, 5 mins)● Where (current practices) (Adrian, 5 mins)● Two case studies

○ AAL (Yeshe, 5 mins)○ IMOS (Natalia, 5 mins)

● Group discussions (Lesley/Jens, 10mins, Flipchart - Adrian)● Summary and next steps (Jens, 5 mins)

Page 3: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

© National Computational Infrastructure 2016

Why: the Growing Need for Data Versioning

Lesley WybornNational Computational Infrastructure ANU

Page 4: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

© National Computational Infrastructure 2016

Why is Versioning Important in eResearch today?

• Historically we have moved on from the traditional ‘book on the shelf model’ for datasets:

– A single researcher/individual research team collected all data used in each research paper

– Versioning was straightforward: there was only one data set that was unique to that paper

eResearch Australasia 2016: BoF on Versioning

Source: http://commons.wikimedia.org/wiki/File:Shelves_of_Language_Books_in_Library.JPG Source: http://en.wikipedia.org/wiki/Library_catalog#/media/File:Schlagwortkatalog.jpg

Page 5: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

© National Computational Infrastructure 2016

We have moved on to the era of sharing and reusing data

• Modern team based research is more common that often utilises publicly available data sets which are capable of being reused or repurposed to support new research directions

• Data can be sourced from national/institutional repositories (49 PB in Australia in RDS):

– Many of these data sets are being continually added to and/or revised

– Data are being copied between repositories or to local sites

– Colocation with HPC/cloud means new data products can be created in very short time frames

• Some of the more mature data centers offer web services access:

– Enables users to dynamically select subsets based on spatial and/or temporal queries

– Rarely are queries the same, particularly where the user graphically draws a spatial bounding box to define the area of interest

eResearch Australasia 2016: BoF on Versioning

Page 6: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

© National Computational Infrastructure 2016

Modern eResearch Data Conundrum

• It is now much harder for a researcher to cite the exact data extract that was used to support a research project or government investigation

– Particularly if the source data set is being dynamically modified – And/or data are accessed via dynamic queries

• With increasing duplication of data sets between the main data repositories and local stores it is also getting harder to identify the canonical or point of truth data set

eResearch Australasia 2016: BoF on VersioningSource: http://generator-meme.com/meme/sad-tiger/

Page 7: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

© National Computational Infrastructure 2016

Basic versioning requirements

• Agreed procedures for versioning data sets and derived data products in a systematized way so that it is possible to reference the exact version of the data that was used:a. to underpin the research findings and/or

b. to generate higher level data products

• When do we attach persistent identifiers to datasets? a. Do we associate persistent identifiers with particular versions of each data set?

b. If data sets are constantly changing how do we determine when it is declared a new version?

c. Who assigns the PID when the data are generated and stored on a 3rd party site?

eResearch Australasia 2016: BoF on Versioning

Source: http://www.fanpop.com/clubs/save-the-tigers/images/8696291/title/tigers-wallpaper

Page 8: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

© National Computational Infrastructure 2016

Satellite Data Use Case

eResearch Australasia 2016: BoF on Versioning

• 857,000 Landsat source scenes in this data set (~52 x 1012 Pixels)

• It is constantly being added to

• Historical errors found in a few scenes (going back >30 years)

• New data products constantly being derived or older versions are being revised

Page 9: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

© National Computational Infrastructure 2016

I AM THE ONE assigning THE DOI to THIS data set

eResearch Australasia 2016: BoF on Versioning

Page 10: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

© National Computational Infrastructure 2016

Versioning: next steps?

• Do we agree that we need to be concerned about versioning or do we put it on ice?

• How should we approach versioning – Jens?

eResearch Australasia 2016: BoF on Versioning

Source: http://www.latimes.com/world/la-fg-c1-china-siberian-tiger-20131001-dto-htmlstory.html

Page 11: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Where?

Page 12: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Software eg 2.1.5

● Major○ Minor

■ Patch

● non backward compatible○ backward compatible new

functionality■ Backward compatible

bug fix

Social conventions to a point

Page 13: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Data?

● Any significant change

V.1, v.2

Page 14: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Data?

● Major○ Minor

● Change scope context or intended use of data○ QA updates

1.2

Page 15: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Data? NASA

● Level 0● Level 1 ● Level 2 ● Level 3

1. ‘raw’ data from the satellite.2. calibrated and geolocated, with

original sampling pattern.3. converted into geophysical

parameters with the original sampling pattern.

4. resampled, averaged over space, interpolated/averaged over time.

Page 16: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Data? AIMS Weather Station Data

● Level 0● Level 1 ● Level 2

1. raw unprocessed data as received from the AWS. No QA

2. all suspect data points removed but no suspect data are corrected.

3. all suspect data points corrected where possible.

Page 17: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Data? IMOS netCDF Conversion

5 levels combining the levels of quality control and the levels of scientific interpretation

Raw data -> Knowledge products

Page 18: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Data? DOI? CSIRO Astronomy

New version = new metadata page = new doi

● Relation to other versions● How different?

Page 19: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Versioning of Data Sets: IMOS/AODN Approach

Natalia Atkins, Sebastien Mancini, Roger Proctor

Page 20: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

What is IMOS Data? – most is dynamic– New data is continuously added– Existing data can be both modified or updated.

File type• File based (e.g. NetCDFs)• Databases (e.g. Animal tracking)• Other (AUV images, acoustic recordings ….)

Data access• Accessed via web services (WMS, WFS and WPS)• THREDDS• Direct data download (from our S3 storage)

Page 21: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

No formal data versioning.

We ask the following: The citation in a list of references is: "IMOS

[year-of-data-download], [Title], [data-access-URL], accessed

[date-of-access].

State of play in terms of DOI

• Set up to manually mint DOIs.

• We have been asked for DOIs for 2 AODN datasets (Australian

Phytoplankton Database, and Glider Climatology product), and

have archived 2 static versions.

Page 22: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Current approaches and possibilities

Data stored on Amazon S3 (object storage)• Since March 2016

• Use of versioning feature

• All previous versions kept except for satellite data

• Web services (WMS, WFS, WPS using Geoserver)• currently not storing user queries,

• most RDA’s recommendations are achievable for small

datasets

• NetCDFs – history is captured within the file

Page 23: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Astronomy Virtual Observatory: data versioningYeshe Fenner

Astronomy Australia Ltd

Page 24: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Data models and formatsData types: images, spectra, image-spectral data cubes, raw visability data (radio interferometry), catalogues

Data size: ~2 petabytes of optical, >12 petabytes radio, 100s terabytes theory

Data formats/models: FITS, HDF5, PostgreSQL, Hadoop/Spark

Data access: 1) web UI, 2) third-party VO apps, 3) APIs

Data ingest: mostly dynamic, but only released publicly at discrete time points

Page 25: Adrian Burton, ANDS Versioning of Data Sets: Why, How ... · eResearch Australasia 2016: BoF on Versioning ... • most RDA’s recommendations are achievable for small datasets •

Data and pipeline versioningGeneral approach to versioning:

● Data Release: new DOIs for each public data release (manual validation & release process). E.g. _v01 to first version, _v02 to 2nd version

● Branches (if applicable): indicating different data processing pipelines to derive same quantity (therefore each property of the astronomical object has its own version)

● Old versions of data/branches remain available, but the current version is the default