High Performance Data Efficient Interoperability for Scientific Data Alex Ip 1 , Andrew Turner 1 , Dr. David Lescinsky 1 1 Geoscience Australia, Canberra, Australia
High Performance Data
Efficient Interoperability for Scientific Data
Alex Ip1, Andrew Turner1, Dr. David Lescinsky1
1Geoscience Australia, Canberra, Australia
Problem: Legacy Data Formats holding us back
• GA’s geoscientific data is currently held in several legacy
formats, some of them proprietary
• Data is not readily interoperable – each different format
requires custom development.
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
https://www.gosoccercollegeusa.com/?attachment_id=2060
http://www.kauerguitars.com/updates/2015/10/22/win-the-mystery-box
Metadata – Tedious but Vital
• Metadata is of varying quality and accessibility – usually held
separately to data.
• Discovery is hit-or-miss. Metadata is sometimes deduced
from filename, context or actual data values
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
http://www.curiouscortex.com/blog/2014/9/26/what-women-want-science-remains-baffled
N
F
I
eed
urther
nformation!
Modern Container Formats (e.g. NetCDF/HDF)
• Transparency and interoperability of text-based file formats,
combined with efficiency of binary formats
• Encapsulated metadata ensures that data and metadata
never become separated
• Internal indexing for rapid subsetting to support large-scale
parallel processing
• Compression and chunking for efficient storage and retrieval
• Robust, well-proven software libraries for most languages and
environments
• Able to be consumed via web services (e.g. OPeNDAP) using
identical code for direct file access
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Container Format Options – Which one?
• Hierarchical Data Format v5
(HDF5) – Flexible, proven,
domain-agnostic scientific data
format
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
NetCDF-CF
• Highly Interoperable
• Extensive Toolsets Available
• Large User Base
NetCDF4
• Simple
• Standardised
• Proven (at large scale)
HDF5
• Self-Describing
• Flexible & Extensible
• Efficient & Scalable
• Robust
• Well Supported
• Network Common Data Form v4
(NetCDF4) – facilitates the use of
data in HDF5 by providing
conventions to constrain structure
and metadata.
• NetCDF Climate & Forecasting
conventions (NetCDF-CF) –
provides enhanced interoperability
by further constraining
NetCDF4/HDF5, allowing us to
leverage existing tools
INT
ER
OP
ER
AB
ILIT
Y
What GA has been doing with its geophysics data…
• GA has been working to convert geophysics data (magnetics,
radiometrics, & gravity) to netCDF
• GA already delivers much of its geophysics data (mostly
gridded) from the NCI NERDIP infrastructure
• NCI has helped GA achieve compliance with relevant data &
metadata standards and conventions, e.g: netCDF, CF
(Climate & Forecasting), ACDD (Attribute Convention for Data
Discovery), etc.
• GA now has systems to synchronise metadata between its
catalogue and the data files
• Have implemented Jupyter Notebook demonstrators for
geophysics data
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Gridded Data vs. Un-gridded (Point & Line)
• Gridded data is relatively straightforward in HDF/NetCDF (e.g.
Earth observation, climate model output, etc)
• Un-gridded point and flight-line (trajectory) geophysics data is
new for netCDF. Approach based on existing NOAA
approaches to oceanographic & atmospheric data.
• Structure of point/line netCDF is very simple – each point-
wise attribute is a separate variable. Able to be consumed via
OPeNDAP web services
• Metadata requires careful attention. Need to standardise
naming conventions for possible inclusion in CF convention
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Representing Geoscientific Data in NetCDF4/HDF5netcdf P452MAG {
dimensions:
sample = 291635 ;
line = 114 ;
line_index = 115 ;
variables:
int index_lines(line_index) ;
index_lines:my_standard_name = "index_lines" ;
index_lines:long_name = "zero based index of the first sample in the line, and then one past the end of the last line" ;
index_lines:units = "1" ;
index_lines:_FillValue = -2147483648 ;
index_lines:_Storage = "contiguous" ;
index_lines:_Endianness = "little" ;
int LINE(line) ;
LINE:my_standard_name = "line_number" ;
LINE:long_name = "line identification number" ;
LINE:units = "1" ;
LINE:_FillValue = -2147483648 ;
LINE:_Storage = "contiguous" ;
LINE:_Endianness = "little" ;
short bearing(line) ;
bearing:my_standard_name = "aircraft_bearing" ;
bearing:units = "degrees" ;
bearing:_FillValue = -32767s ;
bearing:_Storage = "contiguous" ;
bearing:_Endianness = "little" ;
int dateCode(line) ;
dateCode:my_standard_name = "date" ;
dateCode:long_name = "local date on which the line was flown" ;
dateCode:units = "yyyymmdd" ;
dateCode:_FillValue = -2147483648 ;
dateCode:_Storage = "contiguous" ;
dateCode:_Endianness = "little" ;
short flight(line) ;
flight:my_standard_name = "flight_number" ;
flight:long_name = "flight identification number" ;
flight:units = "1" ;
flight:_FillValue = -32767s ;
flight:_Storage = "contiguous" ;
flight:_Endianness = "little" ;
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Making Geoscientific Data Available with Web Services
Using OPeNDAP:
• Data is served
remotely in full
or as subset
• Data can be
delivered in
several formats
including ASCII
or binary
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Accessing geoscientific Data in NetCDF4/HDF5Data can now be rapidly accessed not only in-situ, but also via
OPeNDAP from the NCI directly into environments including
Python, R or MATLAB
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Case Study – Airborne Electromagnetic (AEM) DataSubsurface conductivity section visualisation in Jupyter
Notebook by Neil Symington
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Originally tab-delimited text, now netCDF. Provides sub-second data
reads c.f. minutes. Raw AEM data or derived inversions both handled.
Case Study – Airborne Magnetic & Radiometric
Survey Line Datasets
Trajectory (line) data encoded as netCDF, with line groupings
and full metadata.
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Fully-automated discovery of survey line datasets via CSW, data
subset retrieval for area of interest via OPeNDAP web service.
Case Study – Ground Gravity Survey Point Datasets
• Currently only available as CSV dumps from GA database
• 1600+ ground gravity survey point datasets processed into
netCDF with full metadata
• Both raw data and analysis-ready adjusted values contained
in the same file
• NetCDF permits encoding of point-based metadata as
enumerated types (e.g. data quality string, instrument, etc)
• No sexy visualisations (yet) – translation work was only
completed last week.
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
FAIR Data Principles and GA’s netCDF
• Findable – Datasets are catalogued in GA’s ISO19115
metadata catalogue. Metadata is harvested by the NCI, ANDS
and data.gov.au, amongst others.
• Accessible – Data is published from the NCI via web
services (or via HTTP download)
• Interoperable – netCDF is self-describing, open format, with
libraries implemented in many different programming
environments.
• Reusable – Data has embedded metadata, with links to full
metadata.
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Summary
• NetCDF-CF is a highly suitable container format for multiple types
of geoscientific data
• GA is in the process of translating several geoscientific data types
as part of a pilot program with the NCI NERDIP program
• Proposed controlled vocabularies will be presented to
communities of practice for approval, before being submitted for
inclusion in the CF convention
• Layered, standards-based architecture will speed the development
of future-proof systems
• Improving data interoperability will help break down silos and drive
exciting new, transdisciplinary science.
C3DIS 2018
High Performance Data – Efficient Interoperability for Scientific Data
Phone: +61 2 6249 9111
Web: www.ga.gov.au
Email: [email protected]
Address: Cnr Jerrabomberra Avenue and Hindmarsh Drive, Symonston ACT 2609
Postal Address: GPO Box 378, Canberra ACT 2601
Thank you!
Acknowledgements:
• Ross C Brodie, Yvette Poudjomdjomani, Neil Symington – Geoscience
Australia
• Kelsey Druken, Lesley Wyborn – National Computational Infrastructure
• Edward King, Matt Paget - CSIRO