11/9/2016 1 Implementing a Data Quality Strategy to simplify access to data Kelsey Druken Implementing a Data Quality Strategy to simplify access to data Kelsey Druken, Claire Trenham, Lesley Wyborn, Ben Evans National Computational Infrastructure, Canberra eResearch 2016
17
Embed
Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
11/9/2016
1
Implementing a Data Quality Strategy to simplify access to data
Kelsey Druken
Implementing a Data Quality Strategy to simplify access to data
Kelsey Druken, Claire Trenham, Lesley Wyborn, Ben Evans
National Computational Infrastructure, CanberraeResearch 2016
11/9/2016
2
nci.org.au
• The diverse data collections areco-located with a Petascale HPC and Cloud facility with a: • Top 50 Supercomputer (1.2Pflops)
• HPC Cloud (3000 node)
• Digital Laboratories
• Dynamic subsets are actively encouraged, and can be accessed via data services
• Processing times have decreased dramatically: new large data sets can be generated or analysed in minutes or hours instead of months
National Computational Infrastructure
nci.org.au
National Computational Infrastructure
• NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections
• Spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences
11/9/2016
3
nci.org.au
National Computational Infrastructure
• NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections
• Spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences
nci.org.au
National Computational Infrastructure
1. Climate/ESS Model Assets and Data Products
2. Earth and Marine Observations and Data Products
Digital Elevation, BathymetryOnshore/Offshore Geophysics
1 Pbytes
Seasonal Climate 700 Tbytes
Bureau of Meteorology Observations 350 Tbytes
Bureau of Meteorology Ocean-Marine 350 Tbytes
Terrestrial Ecosystem 290 Tbytes
Reanalysis products 100 Tbytes
11/9/2016
4
nci.org.au
• Application of community-agreed data standards to the broad set of Earth systems and environmental data that are being used
• Within these disciplines, data span a wide range of:- Gridded- Non-gridded (i.e., trajectories/profiles,
point data)- Coordinate reference projections- Resolutions
Key Challenges
nci.org.au
How data collections are accessed
Collections are being accessed and utilised from a broad range of options• Direct access on filesystem• Web and data services • Data portals• Virtual labs (e.g., virtual desktops)
eReefs online analysis portal
11/9/2016
5
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP)
HDF5
NetCDF-4
Climate
GDAL
API Layers
HP Data Library Layer
[SEG-Y][Airborne
Geophysics] [FITS] [LAS
LiDAR]
Data Conventions netCDF-CF
[HDF4-
EOS]
ISO 19115, ACDD, RIF-CS, DCAT, etc.
VGLAGDC
VL
Services Layer
Fast “whole-of-library”
catalogue
Lustre Other Storage (e.g., HDFS)
National Environmental Research Data Interoperability Platform (NERDIP)
Provide seamless programmatic access through standardisation of both data and
services
Data Quality Strategy (DQS)
nci.org.au
• Combining data• Visualising• How can we make enable this type of
easy access and use?
The Goal
11/9/2016
9
nci.org.au
Motivation: Data Management Maturity Program
DMM Capability – 25 Processes to Perform, Manage, Define
4. Data Operations Process Area13. Data Requirements Definition14. Data Lifecycle Management15. Contribution / Provider Management
5. Platform and Architecture Process Area16. Architectural Standards17. Architectural Approach18. Data Management Platform19. Data Integration / Data Linking20. Data Archiving and Preservation
6. Infrastructure Support Practices21. Measurement and Analysis22. Process Management23. Process Quality Assurance24. Risk Management25. Configuration Management
1. Data Management Strategy Process Area1. Data Management Strategy2. Communications3. Data Management Function4. Grant Strategy/Business Case5. Funding
2. Data Governance Process Area6. Governance Management7. Vocabulary/Glossary8. Metadata Management
3. Data Quality Process Area9. Data Quality Strategy10. Data Profiling11. Data Quality Assessment12. Data Cleansing and Curation
Please see the eResearchpresentation on this work: