Top Banner
The HDF Group www.hdfgroup.or g January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking, Clusters Aleksandar Jelenak, John Readey, H. Joe Lee, Ted Habermann 1
13

The HDF Group January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

Jan 18, 2018

Download

Documents

ESIP Winter Meeting Software 3 HDF5 library v Compression libraries: MAFISC/GZIP/BLOSC Operating system: Ubuntu Linux Linux development tools Any HDF5-supported C compiler HDF5 tools: h5dump, h5repack, etc. Python 3 Python packages: h5py, NumPy, ipyparallel, PyTables
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

The HDF Group

1 www.hdfgroup.orgJanuary 8, 20162016 ESIP Winter Meeting

Data Container Study:HDF5 in a POSIX File System

orHDF5 C3: Compression, Chunking, Clusters

Aleksandar Jelenak, John Readey, H. Joe Lee,Ted Habermann

Page 2: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 2

Hardware

• Using Open Science Data Cloud Griffin cluster• Xeon systems with 1-16 cores• 60 compute nodes• 10Gb Ethernet• Ephemeral local POSIX file system• Shared persistent storage (Ceph object store, S3

API)

Page 3: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 3

Software

• HDF5 library v1.8.15• Compression libraries: MAFISC/GZIP/BLOSC• Operating system: Ubuntu Linux• Linux development tools• Any HDF5-supported C compiler• HDF5 tools: h5dump, h5repack, etc.• Python 3• Python packages: h5py, NumPy, ipyparallel,

PyTables

Page 4: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 4

Data

• NCEP/DOE Reanalysis II, for GSSTF, Daily Grid, v3• 0.25×0.25 deg, global• Time span 1987-2008• 7,850 daily files, 120GB

• NOAA Coral Reef Temperature Anomaly Database (CoRTAD) version 5• 0.04165×0.04165 deg (~4km), global• Time range 1982-2012, weekly time step• 8 files, 253GB

Page 5: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 5

WorkflowDownload data as HDF5

files from archive and transfer to S3 object store

Repack original file(s) using HDF5 chunking and compression,

transfer to S3 store

Collate data from original files into one file with HDF5 chunking

& compression, transfer to S3 store

Launch a number of VMs and connect them into a ipyparallel

cluster

Data Ingest/Preprocessing

Data Analysis

Distribute input HDF5 data from S3 store to cluster VMs

Execute data analysis task on cluster VMs

Collect data analysis results from cluster VMs and prepare the

report

Shut down the cluster and VMs

Index data in file(s) by collecting descriptive statistics (min, max,

etc.) for each HDF5 chunk.

Page 6: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 6

System Architecture

Page 7: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 7

HDF5 Chunks

• Chunking is one of storage layouts for HDF5 datasets

• HDF5 dataset’s byte stream is broken up in chunks and stored at various locations in the file

• Chunks are of equal size in dataset’s dataspace but may not be of equal byte size in the file

• HDF5 filtering works on chunks only• Filters for compression/decompression, scaling,

checksum calculation, etc.

Page 8: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 8

Findings: Chunking

• Two different chunking algorithms:• Unidata’s optimal chunking formula for 3D datasets• h5py formula

• Three different chunk sizes chosen for the collated NCEP data set:• Synoptic map: 1×72×144• Data rod: 7850×1×1• Data cube: 25×20×20

Page 9: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 9

Findings: Chunking

• Input was collated NCEP data file:• 7850×720×1440, 5 datasets, 121 gigabytes

• Outputs:

Chunk Size Filter File Size Change (%) Runtime (hour)1×72×144 GZIP level 9 -63.6 9.57850×1×1 GZIP level 9 -62.1 1025×20×20 GZIP level 9 -64.5 6

Page 10: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 10

Findings: Compression

• Compression filters: GZIP, SZIP, MAFISC, Blosc• NCEP data set: 7,850 files• Chunk size: 45×180

Filter Total File Size Change (%) Runtime (hour)

GZIP, level 9 -63.2 7.33

SZIP -70.3 22

MAFISC -86.4 22

Blosc -61.5 4.67

Page 11: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 11

Data Indexing

• Value range information (min, max) captured for each HDF5 dataset chunk

• These value, plus chunk dataset dataspace coordinates stored in a PyTables file

• ~30 minutes to collect index data from the collated NCEP data file

• Work on incorporating this information in processing is on-going

Page 12: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 12

Findings: Parallel

• Load time improved up to 16 nodes• Run time improved super-linearly with

more nodes (up to 64)

Page 13: The HDF Group   January 8, 2016 2016 ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,

www.hdfgroup.org2016 ESIP Winter Meeting 13

Conclusion

• Using a computing environment where POSIX file system is not persistent storage poses unique challenges

• Chunk size does influence runtime• Compression filter performance:

Blosc < GZIP9 < MAFISC• Increasing number of compute nodes reduces

the observed differences in runtime