Top Banner
High Performance Data Efficient Interoperability for Scientific Data Alex Ip 1 , Andrew Turner 1 , Dr. David Lescinsky 1 1 Geoscience Australia, Canberra, Australia
16

High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Oct 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

High Performance Data

Efficient Interoperability for Scientific Data

Alex Ip1, Andrew Turner1, Dr. David Lescinsky1

1Geoscience Australia, Canberra, Australia

Page 2: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Problem: Legacy Data Formats holding us back

• GA’s geoscientific data is currently held in several legacy

formats, some of them proprietary

• Data is not readily interoperable – each different format

requires custom development.

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

https://www.gosoccercollegeusa.com/?attachment_id=2060

http://www.kauerguitars.com/updates/2015/10/22/win-the-mystery-box

Page 3: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Metadata – Tedious but Vital

• Metadata is of varying quality and accessibility – usually held

separately to data.

• Discovery is hit-or-miss. Metadata is sometimes deduced

from filename, context or actual data values

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

http://www.curiouscortex.com/blog/2014/9/26/what-women-want-science-remains-baffled

N

F

I

eed

urther

nformation!

Page 4: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Modern Container Formats (e.g. NetCDF/HDF)

• Transparency and interoperability of text-based file formats,

combined with efficiency of binary formats

• Encapsulated metadata ensures that data and metadata

never become separated

• Internal indexing for rapid subsetting to support large-scale

parallel processing

• Compression and chunking for efficient storage and retrieval

• Robust, well-proven software libraries for most languages and

environments

• Able to be consumed via web services (e.g. OPeNDAP) using

identical code for direct file access

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Page 5: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Container Format Options – Which one?

• Hierarchical Data Format v5

(HDF5) – Flexible, proven,

domain-agnostic scientific data

format

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

NetCDF-CF

• Highly Interoperable

• Extensive Toolsets Available

• Large User Base

NetCDF4

• Simple

• Standardised

• Proven (at large scale)

HDF5

• Self-Describing

• Flexible & Extensible

• Efficient & Scalable

• Robust

• Well Supported

• Network Common Data Form v4

(NetCDF4) – facilitates the use of

data in HDF5 by providing

conventions to constrain structure

and metadata.

• NetCDF Climate & Forecasting

conventions (NetCDF-CF) –

provides enhanced interoperability

by further constraining

NetCDF4/HDF5, allowing us to

leverage existing tools

INT

ER

OP

ER

AB

ILIT

Y

Page 6: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

What GA has been doing with its geophysics data…

• GA has been working to convert geophysics data (magnetics,

radiometrics, & gravity) to netCDF

• GA already delivers much of its geophysics data (mostly

gridded) from the NCI NERDIP infrastructure

• NCI has helped GA achieve compliance with relevant data &

metadata standards and conventions, e.g: netCDF, CF

(Climate & Forecasting), ACDD (Attribute Convention for Data

Discovery), etc.

• GA now has systems to synchronise metadata between its

catalogue and the data files

• Have implemented Jupyter Notebook demonstrators for

geophysics data

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Page 7: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Gridded Data vs. Un-gridded (Point & Line)

• Gridded data is relatively straightforward in HDF/NetCDF (e.g.

Earth observation, climate model output, etc)

• Un-gridded point and flight-line (trajectory) geophysics data is

new for netCDF. Approach based on existing NOAA

approaches to oceanographic & atmospheric data.

• Structure of point/line netCDF is very simple – each point-

wise attribute is a separate variable. Able to be consumed via

OPeNDAP web services

• Metadata requires careful attention. Need to standardise

naming conventions for possible inclusion in CF convention

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Page 8: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Representing Geoscientific Data in NetCDF4/HDF5netcdf P452MAG {

dimensions:

sample = 291635 ;

line = 114 ;

line_index = 115 ;

variables:

int index_lines(line_index) ;

index_lines:my_standard_name = "index_lines" ;

index_lines:long_name = "zero based index of the first sample in the line, and then one past the end of the last line" ;

index_lines:units = "1" ;

index_lines:_FillValue = -2147483648 ;

index_lines:_Storage = "contiguous" ;

index_lines:_Endianness = "little" ;

int LINE(line) ;

LINE:my_standard_name = "line_number" ;

LINE:long_name = "line identification number" ;

LINE:units = "1" ;

LINE:_FillValue = -2147483648 ;

LINE:_Storage = "contiguous" ;

LINE:_Endianness = "little" ;

short bearing(line) ;

bearing:my_standard_name = "aircraft_bearing" ;

bearing:units = "degrees" ;

bearing:_FillValue = -32767s ;

bearing:_Storage = "contiguous" ;

bearing:_Endianness = "little" ;

int dateCode(line) ;

dateCode:my_standard_name = "date" ;

dateCode:long_name = "local date on which the line was flown" ;

dateCode:units = "yyyymmdd" ;

dateCode:_FillValue = -2147483648 ;

dateCode:_Storage = "contiguous" ;

dateCode:_Endianness = "little" ;

short flight(line) ;

flight:my_standard_name = "flight_number" ;

flight:long_name = "flight identification number" ;

flight:units = "1" ;

flight:_FillValue = -32767s ;

flight:_Storage = "contiguous" ;

flight:_Endianness = "little" ;

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Page 9: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Making Geoscientific Data Available with Web Services

Using OPeNDAP:

• Data is served

remotely in full

or as subset

• Data can be

delivered in

several formats

including ASCII

or binary

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Page 10: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Accessing geoscientific Data in NetCDF4/HDF5Data can now be rapidly accessed not only in-situ, but also via

OPeNDAP from the NCI directly into environments including

Python, R or MATLAB

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Page 11: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Case Study – Airborne Electromagnetic (AEM) DataSubsurface conductivity section visualisation in Jupyter

Notebook by Neil Symington

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Originally tab-delimited text, now netCDF. Provides sub-second data

reads c.f. minutes. Raw AEM data or derived inversions both handled.

Page 12: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Case Study – Airborne Magnetic & Radiometric

Survey Line Datasets

Trajectory (line) data encoded as netCDF, with line groupings

and full metadata.

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Fully-automated discovery of survey line datasets via CSW, data

subset retrieval for area of interest via OPeNDAP web service.

Page 13: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Case Study – Ground Gravity Survey Point Datasets

• Currently only available as CSV dumps from GA database

• 1600+ ground gravity survey point datasets processed into

netCDF with full metadata

• Both raw data and analysis-ready adjusted values contained

in the same file

• NetCDF permits encoding of point-based metadata as

enumerated types (e.g. data quality string, instrument, etc)

• No sexy visualisations (yet) – translation work was only

completed last week.

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Page 14: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

FAIR Data Principles and GA’s netCDF

• Findable – Datasets are catalogued in GA’s ISO19115

metadata catalogue. Metadata is harvested by the NCI, ANDS

and data.gov.au, amongst others.

• Accessible – Data is published from the NCI via web

services (or via HTTP download)

• Interoperable – netCDF is self-describing, open format, with

libraries implemented in many different programming

environments.

• Reusable – Data has embedded metadata, with links to full

metadata.

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Page 15: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Summary

• NetCDF-CF is a highly suitable container format for multiple types

of geoscientific data

• GA is in the process of translating several geoscientific data types

as part of a pilot program with the NCI NERDIP program

• Proposed controlled vocabularies will be presented to

communities of practice for approval, before being submitted for

inclusion in the CF convention

• Layered, standards-based architecture will speed the development

of future-proof systems

• Improving data interoperability will help break down silos and drive

exciting new, transdisciplinary science.

C3DIS 2018

High Performance Data – Efficient Interoperability for Scientific Data

Page 16: High Performance Data Efficient Interoperability for ...Python, R or MATLAB C3DIS 2018 ... • Improving data interoperability will help break down silos and drive exciting new, transdisciplinary

Phone: +61 2 6249 9111

Web: www.ga.gov.au

Email: [email protected]

Address: Cnr Jerrabomberra Avenue and Hindmarsh Drive, Symonston ACT 2609

Postal Address: GPO Box 378, Canberra ACT 2601

Thank you!

Acknowledgements:

• Ross C Brodie, Yvette Poudjomdjomani, Neil Symington – Geoscience

Australia

• Kelsey Druken, Lesley Wyborn – National Computational Infrastructure

• Edward King, Matt Paget - CSIRO