Top Banner
File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion
22

File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Dec 14, 2015

Download

Documents

Preston Sarratt
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

File Formats, Conventions,and

Data Level Interoperability

ESDSWG New Orleans, Oct 20, 2010Joe Glassy, Chris Lynnes ESDSWG Tech Infusion

Page 2: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Introduction & overview

• Outline of objectives:– Discuss role of standard, self-describing “File

formats” in data level interoperability– Summarize common file formats in use, their

properties, & benefits --“data life cycle economics”– Discuss criteria for choosing a file format, matching

it to needs of consumer/producers.– Discuss critical role of Conventions – any file format

needs good recipes to make them interoperable!– Examples: NASA Measures F/T, SMAP, AIRs, Aura

Page 3: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Role(s) Of File Formats in Interoperability

• File formats represent versatile “packages” for multi-dimensional science data and metadata.

• Offer self-describing “well-known structures” to codify desired, common conventions and practices.

• Offer well-documented reference cases to encapsulate specific data models.

• Standard file formats dock with format-aware tools to offer users a seamless end-to-end experience and platform portability

• Enhance Mission-to-Mission continuity

Page 4: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

…investment life-cycle economics…

Page 5: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Why (and how) are file formats important?

• Standard formats– Come with thorough documentation– Provide good Reference implementations

• Common formats– More datasets in a format more tools that read

that format• Canonical structures and names

general purpose handlers for coordinates, etc. smarter tools

Page 6: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

A generic work flow…

• Consider user community needs and culture, fit within architecture, institutional policies & preferences

• Choose a standard file format (or sub-variant)• Design a convention-enabled, specific internal layout

with metadata interfaces• Prototype: Implement in prototype, evaluate• Implement in production context• Integrate within discovery and catalog environments

(Catalog interoperability…)

Page 7: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Examples of standard file formats

• HDF5 – a file format on its own, as well as a broad foundation for others

• netCDF v4 (stable at v4.1.1, newest : v4.1.2-beta1)– v4 Classic (widespread adoption, some limitations…)– v4 Enhanced (support Groups, User-defined, variable length

types, and more)• netCDF v3 Classic (legacy+ , tools+, but limited)• HDFEOS2, HDFEOS5 – EOS Terra, Aqua, Aura…• HDF4 – legacy, extensive use by MODIS Terra, Aqua• Many other domain-specific, less generic formats abound…

(need transform tools to/from HDF?)

Page 8: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Some selection criteria…• Do file-format’s capabilities support required

functionality?• What is breadth of acceptance, adoption within larger

community? (and/or, does institutional policy dictate a specific format?)

• Presence and quality of documentation (reference, examples and especially tutorials), API software, and community support?

• Contribution to investment, data life-cycle economics?• What is the level of standardization?• Adaptability of format to widely used conventions like

CF 1.x, or other accepted convention(s)?

Page 9: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Internal Layout / Design(once format is chosen & adopted…)

• Define &refine High level organization /structure• /DATA• /METADATA

• Distinguish ‘data’ from ‘metadata’, core structure vs. ‘attributes’– Dimensions, Coordinate Variables, projection attributes– Missing_data, _Fillvalue vs. internal fill value– Units, Gain, offset, min, max, range, etc.

• Prototype it!– Leverage script environments (Python H5Py, PyTables, etc)– Panoply, HDFView also quick, useful for prototyping, feedback

Page 10: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Using “Groups”

• HDF5 (and NetCDF v4-Enhanced) support full use of groups e.g. /DATA vs. /METADATA, etc.

• Groups useful in partitioning out functionally related sets of data or attributes; Hierarchical view mimics file-system

• Facilitates appropriate information-hiding, highlights needed info, shield other (principle of least privilege…)

• Well supported by modern tools (Panoply, HDFViews, PyTables, H5Py) and low-lev APIs.

Page 11: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Example(s) of File Formats In Action

• HDF5 – NASA Measures – NASA Measures Freeze/Thaw (soon available at NSIDC)– http://measures.ntsg.umt.edu/sample_2007_day180.zip

• AQUA AIRS Level 2 (from earlier talk):– http://airspar1u.ecs.nasa.gov/opendap/Aqua_AIRS_Level2/AIRX2RET.005/201

0/285/AIRS.2010.10.12.090.L2.RetStd.v5.2.2.0.G10286064818.hdf

• Aura TES (TES-Aura_L3-CH4_r0000002135_F01_05.he5)

Page 12: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Example: NASA Measures Freeze/Thaw, Daily in HDF5Metadata Block: Attributes

Page 13: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Example: NASA Measures Daily Freeze/Thaw in HDF5Data Variable (FT_SSMI) and Attributes

Page 14: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Example: NASA Level 2 AIRS (Swath) in HDF4

Page 15: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Example: NetCDF, (tos) Sea surface temperatures collected by PCMDI for use by the IPCC, illustrating CF v1.0 layout

Page 16: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Example: TES (HDFEOS5) illustrating CF v1.0 layout

Page 17: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

CF Conventions & file formats:--how they contribute to interoperability.

• CF v1.4.x -- the term “CF” is now broader than just climate-forecasting!

• Standard Name Table -- a step towards wider adoption of names, controlled vocabularies, units terminology

• CF v1.4.x provides tool-makers with helpful “lingua-franca” guidance.

• Within a file-format, adopting conventions like CF promotes common layout, names, semantics, for dataset-to-dataset compatibility -- a key to wider data level interoperability.

Page 18: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Attributes vs. Metadata?one man’s ceiling is another man’s floor…

• Collection level vs. Data Set vs. Granule level• Structural vs. science-content• Swath vs. grid vs. point• Commonly used attributes:

– CONVENTIONS attrib, communicates which convention was used

– Basic globals: title, history, institution, source, references– Coordinate variables, axis, formula_terms– Units, _Fillvalue, missing_data, valid_range– Short_name, long_name, other provenance– (gain,offset /scale_factor,addOffset), etc.

Page 19: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Challenges? (just a few remain…)

• Evolution, bifurcation, asymmetric support can result in occasional user confusion:– HDF v1.8.x vs. v1.6.x families?– NetCDF v4 Enhanced vs. NetCDF v4 Classic vs. v3?– HDFEOS5 vs. HDFEOS2?

• Both GUI tool and API support tends to vary by platform (Linux, Mac, Win7) and sub-flavor…

• Multi-library dependency stacks beg for fully bundled, version-matched end-to-end install pkg!

• Conventions community (CF v1.4.x) and metadata standards communities also in motion (but that’s good too…)

Page 20: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Resources : URLs• Climate Forecast (CF) Conventions (now at 1.4.x):

– http://cf-pcmdi.llnl.gov/– http://cf-pcmdi.llnl.gov/documents/cf-conventions

• HDF: – http://www.hdfgroup.org/HDF5/doc/index.html

• HDFEOS– http://www.hdfgroup.org/hdfeos.html– http://hdfeos.org/software/aug_hdfeos5.php

• NetCDF: – http://www.unidata.ucar.edu/software/netcdf/– http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.ht

ml• General:

– http://www.oceanteacher.org/OTMediawiki/index.php/Self-Describing_Formats

– http://en.wikipedia.org/wiki/List_of_file_formats

Page 21: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

Resources: File format related Tools

• Panoply: http://www.giss.nasa.gov/tools/panoply/

• HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/

• OpenDAP: http://opendap.org

• IDV: http://www.unidata.ucar.edu/software/idv/

• McIDAS: http://www.unidata.ucar.edu/software/mcidas/

• Python: – h5py : http://code.google.com/p/h5py/, http://h5py.alfven.org/, – PyTables: http://www.pytables.org/moin

• Perl: PDL-IO-HDF5, and Biohdf?

• Many others: HEG, MTD, HDFEOS plug-in for HDFview, HDFLook, (ncdump, h5dump, and cousins), GRADS, Matlab, binary APIs

Page 22: File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion.

A provisional DOI, UUID Strategy

• What we used for NASA Measures Freeze/Thaw, daily (v2) just delivered:– DOI: assigned to our reference paper, by IEEE

Transactions in Geoscience and Remote Sensing– UUID recipe, seedString =

www.our.url/GranuleName/Datetime8601StampImport uuiduuid= uuid.uuid5(seedString)