File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion
Dec 14, 2015
File Formats, Conventions,and
Data Level Interoperability
ESDSWG New Orleans, Oct 20, 2010Joe Glassy, Chris Lynnes ESDSWG Tech Infusion
Introduction & overview
• Outline of objectives:– Discuss role of standard, self-describing “File
formats” in data level interoperability– Summarize common file formats in use, their
properties, & benefits --“data life cycle economics”– Discuss criteria for choosing a file format, matching
it to needs of consumer/producers.– Discuss critical role of Conventions – any file format
needs good recipes to make them interoperable!– Examples: NASA Measures F/T, SMAP, AIRs, Aura
Role(s) Of File Formats in Interoperability
• File formats represent versatile “packages” for multi-dimensional science data and metadata.
• Offer self-describing “well-known structures” to codify desired, common conventions and practices.
• Offer well-documented reference cases to encapsulate specific data models.
• Standard file formats dock with format-aware tools to offer users a seamless end-to-end experience and platform portability
• Enhance Mission-to-Mission continuity
Why (and how) are file formats important?
• Standard formats– Come with thorough documentation– Provide good Reference implementations
• Common formats– More datasets in a format more tools that read
that format• Canonical structures and names
general purpose handlers for coordinates, etc. smarter tools
A generic work flow…
• Consider user community needs and culture, fit within architecture, institutional policies & preferences
• Choose a standard file format (or sub-variant)• Design a convention-enabled, specific internal layout
with metadata interfaces• Prototype: Implement in prototype, evaluate• Implement in production context• Integrate within discovery and catalog environments
(Catalog interoperability…)
Examples of standard file formats
• HDF5 – a file format on its own, as well as a broad foundation for others
• netCDF v4 (stable at v4.1.1, newest : v4.1.2-beta1)– v4 Classic (widespread adoption, some limitations…)– v4 Enhanced (support Groups, User-defined, variable length
types, and more)• netCDF v3 Classic (legacy+ , tools+, but limited)• HDFEOS2, HDFEOS5 – EOS Terra, Aqua, Aura…• HDF4 – legacy, extensive use by MODIS Terra, Aqua• Many other domain-specific, less generic formats abound…
(need transform tools to/from HDF?)
Some selection criteria…• Do file-format’s capabilities support required
functionality?• What is breadth of acceptance, adoption within larger
community? (and/or, does institutional policy dictate a specific format?)
• Presence and quality of documentation (reference, examples and especially tutorials), API software, and community support?
• Contribution to investment, data life-cycle economics?• What is the level of standardization?• Adaptability of format to widely used conventions like
CF 1.x, or other accepted convention(s)?
Internal Layout / Design(once format is chosen & adopted…)
• Define &refine High level organization /structure• /DATA• /METADATA
• Distinguish ‘data’ from ‘metadata’, core structure vs. ‘attributes’– Dimensions, Coordinate Variables, projection attributes– Missing_data, _Fillvalue vs. internal fill value– Units, Gain, offset, min, max, range, etc.
• Prototype it!– Leverage script environments (Python H5Py, PyTables, etc)– Panoply, HDFView also quick, useful for prototyping, feedback
Using “Groups”
• HDF5 (and NetCDF v4-Enhanced) support full use of groups e.g. /DATA vs. /METADATA, etc.
• Groups useful in partitioning out functionally related sets of data or attributes; Hierarchical view mimics file-system
• Facilitates appropriate information-hiding, highlights needed info, shield other (principle of least privilege…)
• Well supported by modern tools (Panoply, HDFViews, PyTables, H5Py) and low-lev APIs.
Example(s) of File Formats In Action
• HDF5 – NASA Measures – NASA Measures Freeze/Thaw (soon available at NSIDC)– http://measures.ntsg.umt.edu/sample_2007_day180.zip
• AQUA AIRS Level 2 (from earlier talk):– http://airspar1u.ecs.nasa.gov/opendap/Aqua_AIRS_Level2/AIRX2RET.005/201
0/285/AIRS.2010.10.12.090.L2.RetStd.v5.2.2.0.G10286064818.hdf
• Aura TES (TES-Aura_L3-CH4_r0000002135_F01_05.he5)
Example: NetCDF, (tos) Sea surface temperatures collected by PCMDI for use by the IPCC, illustrating CF v1.0 layout
CF Conventions & file formats:--how they contribute to interoperability.
• CF v1.4.x -- the term “CF” is now broader than just climate-forecasting!
• Standard Name Table -- a step towards wider adoption of names, controlled vocabularies, units terminology
• CF v1.4.x provides tool-makers with helpful “lingua-franca” guidance.
• Within a file-format, adopting conventions like CF promotes common layout, names, semantics, for dataset-to-dataset compatibility -- a key to wider data level interoperability.
Attributes vs. Metadata?one man’s ceiling is another man’s floor…
• Collection level vs. Data Set vs. Granule level• Structural vs. science-content• Swath vs. grid vs. point• Commonly used attributes:
– CONVENTIONS attrib, communicates which convention was used
– Basic globals: title, history, institution, source, references– Coordinate variables, axis, formula_terms– Units, _Fillvalue, missing_data, valid_range– Short_name, long_name, other provenance– (gain,offset /scale_factor,addOffset), etc.
Challenges? (just a few remain…)
• Evolution, bifurcation, asymmetric support can result in occasional user confusion:– HDF v1.8.x vs. v1.6.x families?– NetCDF v4 Enhanced vs. NetCDF v4 Classic vs. v3?– HDFEOS5 vs. HDFEOS2?
• Both GUI tool and API support tends to vary by platform (Linux, Mac, Win7) and sub-flavor…
• Multi-library dependency stacks beg for fully bundled, version-matched end-to-end install pkg!
• Conventions community (CF v1.4.x) and metadata standards communities also in motion (but that’s good too…)
Resources : URLs• Climate Forecast (CF) Conventions (now at 1.4.x):
– http://cf-pcmdi.llnl.gov/– http://cf-pcmdi.llnl.gov/documents/cf-conventions
• HDF: – http://www.hdfgroup.org/HDF5/doc/index.html
• HDFEOS– http://www.hdfgroup.org/hdfeos.html– http://hdfeos.org/software/aug_hdfeos5.php
• NetCDF: – http://www.unidata.ucar.edu/software/netcdf/– http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.ht
ml• General:
– http://www.oceanteacher.org/OTMediawiki/index.php/Self-Describing_Formats
– http://en.wikipedia.org/wiki/List_of_file_formats
Resources: File format related Tools
• Panoply: http://www.giss.nasa.gov/tools/panoply/
• HDFView: http://www.hdfgroup.org/hdf-java-html/hdfview/
• OpenDAP: http://opendap.org
• IDV: http://www.unidata.ucar.edu/software/idv/
• McIDAS: http://www.unidata.ucar.edu/software/mcidas/
• Python: – h5py : http://code.google.com/p/h5py/, http://h5py.alfven.org/, – PyTables: http://www.pytables.org/moin
• Perl: PDL-IO-HDF5, and Biohdf?
• Many others: HEG, MTD, HDFEOS plug-in for HDFview, HDFLook, (ncdump, h5dump, and cousins), GRADS, Matlab, binary APIs
A provisional DOI, UUID Strategy
• What we used for NASA Measures Freeze/Thaw, daily (v2) just delivered:– DOI: assigned to our reference paper, by IEEE
Transactions in Geoscience and Remote Sensing– UUID recipe, seedString =
www.our.url/GranuleName/Datetime8601StampImport uuiduuid= uuid.uuid5(seedString)