April 28, 200 8 LCI Tutorial 1 HDF5 Tutorial LCI April 28, 2008
April 28, 2008 LCI Tutorial 1
HDF5 Tutorial
LCI
April 28, 2008
April 28, 2008 LCI Tutorial 2
Outline
• Why HDF5?• Introduction to HDF5 data and programming
models• HDF5 tools and utilities• HDF5 advanced topics• Introduction to parallel HDF5• HDF5 features that affect performance (or
caching and buffering in HDF5)
April 28, 2008 LCI Tutorial 3
Why HDF5?
April 28, 2008 LCI Tutorial 4
Matter & the universe
Weather and climateAugust 24, 2001 August 24, 2002
Total Column Ozone (Dobson)
60 385 610
Life and nature
Answering big questions …
April 28, 2008 LCI Tutorial 5
… involves big data …
April 28, 2008 LCI Tutorial 6
… varied data …
Thanks to Mark Miller, LLNL
April 28, 2008 LCI Tutorial 7
… and complex relationships …
Contig Summaries
Discrepancies
Contig Qualities
Coverage Depth
Read Read qualityquality
Aligned bases
ContigContig
Reads
Percent match
TraceTrace
SNP ScoreSNP Score
April 28, 2008 LCI Tutorial 8
… on big computers …
April 28, 2008 LCI Tutorial 9
… and on little computers …
April 28, 2008 LCI Tutorial 10
How do we…
• Describe our data? • Read it? Store it? Find it? Share it? Mine it? • Move it into, out of, and between computers and
repositories?• Achieve storage and I/O efficiency?• Give applications and tools easy access our data?
April 28, 2008 LCI Tutorial 11
HDF started right here at NCSA
April 28, 2008 LCI Tutorial 12
HDF solution
I/O software & tools
CommonData
models
StandardAPIs
Scientific data file format
Efficient storage, I/O
April 28, 2008 LCI Tutorial 13
The HDF5 Format
April 28, 2008 LCI Tutorial 14
An HDF5 file is a container…
lat | lon | temp----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6
palette
palette
……into into which you which you can put can put your data your data objects.objects.
April 28, 2008 LCI Tutorial 15
HDF5 structures for organizing objects
palettepalette
Raster imageRaster image
3-D array3-D array
2-D array2-D arrayRaster imageRaster image
lat | lon | templat | lon | temp----|-----|---------|-----|----- 12 | 23 | 3.112 | 23 | 3.1 15 | 24 | 4.215 | 24 | 4.2 17 | 21 | 3.617 | 21 | 3.6
TableTable
““/” /” (root)(root)““/” /” (root)(root)
““/foo”/foo”““/foo”/foo”
April 28, 2008 LCI Tutorial 16
Introduction to HDF5 Data and Programming Models
Tutorial
Part I
April 28, 2008 LCI Tutorial 17
Mesh Example, in HDFView
April 28, 2008 LCI Tutorial 18
HDF5 Data Model
April 28, 2008 LCI Tutorial 19
HDF5 data model
• HDF5 file – container for scientific data• Primary Objects
• Groups
• Datasets
• Additional ways to organize data• Attributes
• Sharable objects
• Storage and access properties
Everything else is built from these
Everything else is built from these
parts.parts.
April 28, 2008 LCI Tutorial 20
HDF5 Dataset
DataMetadataDataspace
3
RankRank
Dim_2 = 5Dim_1 = 4
DimensionsDimensions
Time = 32.4
Pressure = 987
Temp = 56
AttributesAttributes
Chunked
Compressed
Dim_3 = 7
Storage infoStorage info
IEEE 32-bit floatDatatypeDatatype
April 28, 2008 LCI Tutorial 21
Dataspaces
• Two roles• Dataspace contains spatial info about a dataset
stored in a file• Rank and dimensions• Permanent part of dataset
definition
• Dataspace describes application’s data buffer and data elements participating in I/O
Rank = 2Rank = 2
Dimensions = 4x6Dimensions = 4x6
Rank = 1Rank = 1
Dimensions = 12Dimensions = 12
April 28, 2008 LCI Tutorial 22
Datatypes (array elements)
• Datatype – how to interpret a data element• Permanent part of the dataset definition
• Two classes: atomic and compound
April 28, 2008 LCI Tutorial 23
Datatypes
• HDF5 atomic types normal integer & float user-definable (e.g. 13-bit integer) variable length types (e.g. strings) pointers - references to objects/dataset regions enumeration - names mapped to integers array
• HDF5 compound types Comparable to C structs Members can be atomic or compound types
April 28, 2008 LCI Tutorial 24
RecordRecord
int8int8 int4int4 int16int16 2x3x2 array of float322x3x2 array of float32Datatype:Datatype:
HDF5 dataset: array of records
Dimensionality: 5 x 3Dimensionality: 5 x 3
3
5
April 28, 2008 LCI Tutorial 25
Attributes
• Attribute – data of the form “name = value”, attached to an object• Operations scaled down versions of dataset operations
Not extendible No compression No partial I/O
• Optional for the dataset definition• Can be overwritten, deleted, added during the “life” of a dataset• Size under 64K in releases before HDF5 1.8.0
April 28, 2008 LCI Tutorial 26
Groups
• A mechanism for collections of related objects
• Every file starts with a root group
• Similar to UNIX directories
• Can have attributes
“/”A B
C
k l m
April 28, 2008 LCI Tutorial 27
“/”x
temp
temp
/ (root)/x/foo/foo/temp/foo/bar/temp
Path to HDF5 object in a file
foo
bar
April 28, 2008 LCI Tutorial 28
Shared objects
/A/P/A/P/B/R/B/R/C/P/C/P
“/”A B C
PR P
April 28, 2008 LCI Tutorial 29
Special Storage Options
Better subsetting Better subsetting access time; access time; extendableextendable
chunked
Improves storage Improves storage efficiency, efficiency, transmission speedtransmission speed
compressedcompressed
Arrays can be Arrays can be extended in any extended in any directiondirection
extendableextendable
Metadata for FredMetadata for FredMetadata for FredMetadata for Fred
Dataset “Fred”Dataset “Fred”Dataset “Fred”Dataset “Fred”
File AFile A
File BFile B
Data for FredData for Fred
Metadata in one file, Metadata in one file, raw data in anotherraw data in anothersplit filesplit file
April 28, 2008 LCI Tutorial 30
HDF5 Software
April 28, 2008 LCI Tutorial 31
HDF5 software stack
Tools & ApplicationsTools & ApplicationsTools & ApplicationsTools & Applications
HDF FileHDF FileHDF FileHDF File
HDF I/O LibraryHDF I/O LibraryHDF I/O LibraryHDF I/O Library
April 28, 2008 LCI Tutorial 32
Virtual file I/O (C only)Virtual file I/O (C only) Perform byte-stream I/O operations (open/close, read/write, seek) User-implementable I/O (stdio, network, memory, etc.)
Virtual file I/O (C only)Virtual file I/O (C only) Perform byte-stream I/O operations (open/close, read/write, seek) User-implementable I/O (stdio, network, memory, etc.)
Library internalsLibrary internals• Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.)
Library internalsLibrary internals• Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.)
Structure of HDF5 Library
Object API (C, Fortran 90, Java, C++)Object API (C, Fortran 90, Java, C++) Specify objects and transformation properties Invoke data movement operations and data transformations
Object API (C, Fortran 90, Java, C++)Object API (C, Fortran 90, Java, C++) Specify objects and transformation properties Invoke data movement operations and data transformations
April 28, 2008 LCI Tutorial 33
Writing – move from memory to disk
memorymemory diskdisk
April 28, 2008 LCI Tutorial 34
Partial I/O
(b) Regular series of blocks from a 2D array to a contiguous sequence at a certain offset in a 1D array
memorymemorydiskdisk(a) Hyperslab from a 2D array to the corner of a smaller 2D array
memorymemory diskdisk
Move just part of a dataset
April 28, 2008 LCI Tutorial 35
(c) A sequence of points from a 2D array to a sequence of points in a 3D array.
memorymemorydiskdisk
(d) Union of hyperslabs in file to union of hyperslabs in memory.
Partial I/O
memorymemory diskdisk
Move just part of a dataset
April 28, 2008 LCI Tutorial 36
Layers – parallel example
ApplicationApplication
Parallel computing system (Linux cluster)Parallel computing system (Linux cluster)Compute
nodeCompute
node
I/O library (HDF5)I/O library (HDF5)
Parallel I/O library (MPI-I/O)Parallel I/O library (MPI-I/O)
Parallel file system (GPFS)Parallel file system (GPFS)
Switch network/I/O serversSwitch network/I/O servers
Computenode
Computenode
Computenode
Computenode
Computenode
Computenode
Disk architecture & layout of data on diskDisk architecture & layout of data on disk
I/O flows through many layers from application to disk.
April 28, 2008 LCI Tutorial 37
Virtual file I/O (C only)Virtual file I/O (C only)Virtual file I/O (C only)Virtual file I/O (C only)
Library internalsLibrary internalsLibrary internalsLibrary internals
Virtual I/O layer
Object API (C, Fortran 90, Java, C++)Object API (C, Fortran 90, Java, C++)Object API (C, Fortran 90, Java, C++)Object API (C, Fortran 90, Java, C++)
April 28, 2008 LCI Tutorial 38
Virtual file I/O layer
• A public API for writing I/O drivers• Allows HDF5 to interface to disk, the network,
memory, or a user-defined device
Network
NetworkFile Family MPI I/O Memory
Virtual file I/O driversVirtual file I/O drivers
Memory
Stdio
File File FamilyFamily
FileFile
““Storage”Storage”
April 28, 2008 LCI Tutorial 39
StorageStorage
File on parallelFile on parallelfile systemfile systemFileFile
Split metadata Split metadata and raw data filesand raw data files
User-definedUser-defineddevicedevice
?? Across the networkAcross the networkor to/from anotheror to/from another
application or libraryapplication or libraryHDF5 formatHDF5 format
HDF5HDF5 data model & API data model & API
Apps: simulation, visualization, remote sensing…
Examples: Thermonuclear simulationsProduct modelingData mining tools
Visualization toolsClimate models
Common application-specific data models
HDF5 virtual file layer (I/O drivers)HDF5 virtual file layer (I/O drivers)
MPI I/OMPI I/OSplit FilesSplit FilesStdioStdio CustomCustom StreamStreamHDF5 serial & HDF5 serial &
parallel I/Oparallel I/O
UDM SAF hdf5mesh HDF-EOSIDLappl-specificappl-specific
APIsLANL LLNL, SNL Grids COTS NASA
April 28, 2008 LCI Tutorial 40
Other info
• Runs almost anywhereMost workstationsBig ASC machines, Cray, CompaqTeraGrid and other clusters
• QADaily regression tests on key platformsMeets NASA’s highest technology readiness level
April 28, 2008 LCI Tutorial 41
Other HDF Software
• THG HDF Java toolsCommand-line utilitiesRegression and performance testing software
• Commercial (IDL, Matlab, HDF Explorer, etc.)• Community (EOS, ASCI, etc.)• Integration with other software (SRB, etc.)
April 28, 2008 LCI Tutorial 43
Creating an HDF5 file with HDF5 tools
HDFView, h5mkgrp, h5import
April 28, 2008 LCI Tutorial 44
A B
“/” (root)
Example: create this HDF5 file
4x6 array of floats
April 28, 2008 LCI Tutorial 45
Example: create this HDF5 file
• HDFView• h5mkgrp file.h5 /B • h5import A.txt -c A.conf -o file.h5
April 28, 2008 LCI Tutorial 46
Introduction to HDF5 Programming model
and APIs
Programming model for sequential access
April 28, 2008 LCI Tutorial 47
HDF5 Software stack
Tools & ApplicationsTools & ApplicationsTools & ApplicationsTools & Applications
HDF FileHDF FileHDF FileHDF File
HDF I/O LibraryHDF I/O LibraryHDF I/O LibraryHDF I/O Library
April 28, 2008 LCI Tutorial 48
Virtual file I/O (C only)Virtual file I/O (C only) Perform byte-stream I/O operations (open/close, read/write, seek) User-implementable I/O (stdio, network, memory, etc.)
Virtual file I/O (C only)Virtual file I/O (C only) Perform byte-stream I/O operations (open/close, read/write, seek) User-implementable I/O (stdio, network, memory, etc.)
Library internalsLibrary internals• Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.)
Library internalsLibrary internals• Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.)
Structure of HDF5 Library
Object API (C, Fortran 90, Java, C++)Object API (C, Fortran 90, Java, C++) Specify objects and transformation properties Invoke data movement operations and data transformations
Object API (C, Fortran 90, Java, C++)Object API (C, Fortran 90, Java, C++) Specify objects and transformation properties Invoke data movement operations and data transformations
April 28, 2008 LCI Tutorial 49
Goals of HDF5 Library
• Flexible API to support a wide range of operations on data
• High performance access in serial and parallel computing environments
• Compatibility with common data models and programming languages
Because of these goals,
the HDF5 API is rich and large
April 28, 2008 LCI Tutorial 50
Operations supported by the API
• Create groups, datasets, attributes, linkages• Create complex data types• Assign storage and I/O properties to objects• Complex subsetting during read/write• Flexible I/O (parallel, remote, etc.)• Ability to transform data during I/O• Query about file and structure and properties• Query about object structure, content, properties
April 28, 2008 LCI Tutorial 51
Characteristics of the HDF5 API
• For flexibility, the API is extensive – 300+ functions• This can be daunting, at first• But there is hope
You can do a lot with a just few functionsSo start simple, and build up your knowledge
• The library functions are categorized by object typeOnce you learn the system, it’s much less daunting
• And there is an “H5Lite” API if all you want to do are simple things.
April 28, 2008 LCI Tutorial 52
The General HDF5 API
• Currently has C, Fortran 90, Java and C++ bindings.
• C routines begin with prefix H5*, where * is a single letter indicating the object on which the operation is to be performed.
• Full functionality Example APIs:
H5D : Dataset interface e.g.. H5Dread H5F : File interface e.g.. H5Fopen H5S : dataSpace interfacee.g.. H5Sclose
April 28, 2008 LCI Tutorial 53
How to Compile HDF5 Applications
• h5cc – HDF5 C compiler command• Similar to mpicc
• h5fc – HDF5 F90 compiler command• Similar to mpif90
• h5c++ – HDF5 C++ compiler command• To compile:
• % h5cc h5prog.c• % h5fc h5prog.f90
April 28, 2008 LCI Tutorial 54
h5cc/h5fc/h5c++ -show option
• show displays the compiler commands and options without executing them, i.e., dry run% h5cc -show Sample_c.c
gcc -I/home/packages/hdf5_1.6.6/Linux_2.6/include -UH5_DEBUG_API -DNDEBUG -I/home/packages/szip/static/encoder/Linux2.6-gcc/include -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_POSIX_SOURCE -D_BSD_SOURCE -std=c99 -Wno-long-long -O -fomit-frame-pointer -finline-functions -c Sample_c.c
gcc -std=c99 -Wno-long-long -O -fomit-frame-pointer -finline-functions -L/home/packages/szip/static/encoder/Linux2.6-gcc/lib Sample_c.o -L/home/packages/hdf5_1.6.6/Linux_2.6/lib /home/packages/hdf5_1.6.6/Linux_2.6/lib/libhdf5_hl.a /home/packages/hdf5_1.6.6/Linux_2.6/lib/libhdf5.a -lsz -lz -lm -Wl,-rpath -Wl,/home/packages/hdf5_1.6.6/Linux_2.6/lib
April 28, 2008 LCI Tutorial 55
The General Programming Paradigm
• Properties (called creation and access property lists) of objects are defined (optional)
• Objects are opened or created• Objects then accessed• Objects finally closed
April 28, 2008 LCI Tutorial 56
Order of Operations
• The library imposes an order on the operations by argument dependenciesExample: A file must be opened before a dataset because the dataset open call requires a file handle as an argument
• Objects can be closed in any order, and reusing a closed object will result in an error
April 28, 2008 LCI Tutorial 57
HDF5 C Programming Issues
For portability, HDF5 library has its own defined types:
hid_t: object identifiers (native integer) hsize_t: size used for dimensions (unsigned long or
unsigned long long) hssize_t: for specifying coordinates and sometimes for
dimensions (signed long or signed long long) herr_t: function return value
hvl_t: variable length datatype
For C, include hdf5.h at the top of your HDF5 application.
April 28, 2008 LCI Tutorial 58
A B
“/” (root)
4x6 array of floats
Example: Create this HDF file
April 28, 2008 LCI Tutorial 59
A
4x6 array of floats
B
“/” (root)Steps:
Example: Create this HDF file
April 28, 2008 LCI Tutorial 60
“/” (root)
Create and HDF5 file
April 28, 2008 LCI Tutorial 61
Steps to Create a File
• Decide any special properties the file should have • Creation properties, like size of user block
• Access properties, such as metadata cache size
• Create property lists, if necessary• Create the file• Close the file and the property lists, as needed
April 28, 2008 LCI Tutorial 62
Create new file with default properties
1 hid_t file_id; 2 herr_t status; 3 file_id = H5Fcreate("file.h5",H5F_ACC_TRUNC,
H5P_DEFAULT,H5P_DEFAULT);
4 status = H5Fclose (file_id);
April 28, 2008 LCI Tutorial 63
4x6 array of floats
A“/” (root)
Add a dataset
April 28, 2008 LCI Tutorial 64
Dataset components
DataMetadataDataspace
3
RankRank
Dim_2 = 5Dim_1 = 4
DimensionsDimensions
Time = 32.4
Pressure = 987
Temp = 56
AttributesAttributes
Chunked
Compressed
Dim_3 = 7
Storage infoStorage info
IEEE 32-bit floatDatatypeDatatype
April 28, 2008 LCI Tutorial 65
Dataset Creation Property List
Dataset creation property list: information on how to organize data in storage.
ChunkedChunked
Chunked & Chunked & compressedcompressed
April 28, 2008 LCI Tutorial 66
Steps to create a dataset in a file
1. Define dataset characteristics• Dataspace – 4x6• Datatype – float• Properties (if needed)
2. Decide where to put it – “root group”• Obtain location ID
3. Decide link or path – “A”4. Create link and dataset in file5. Close everything
A“/” (root)
April 28, 2008 LCI Tutorial 67
1 hid_t file_id, dataset_id, dataspace_id; 2 hsize_t dims[2];3 herr_t status; 4 file_id = H5Fcreate (”file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT); 5 dims[0] = 4;6 dims[1] = 6;7 dataspace_id = H5Screate_simple (2, dims, NULL); 8 dataset_id = H5Dcreate(file_id,”A",H5T_STD_I32BE, dataspace_id, H5P_DEFAULT);
9 status = H5Dclose (dataset_id); 10 status = H5Sclose (dataspace_id); 11 status = H5Fclose (file_id);
Example: create a dataset
Terminate access to dataset, dataspace, & file
Create a Create a dataspacedataspace rankrank current current
dimsdims
Create a dataset
Dataspace
Datatype
Property list (default)
Pathname
April 28, 2008 LCI Tutorial 68
A B
“/” (root)
Example: create this HDF5 file
4x6 array of floats
file.h5
April 28, 2008 LCI Tutorial 69
Creating a group
• To create a group, the calling program must: • Obtain location identifier where group is to be
created
• Create the group
• Close the group
April 28, 2008 LCI Tutorial 70
Creating a group
hid_t file_id, group_id; /* identifiers */ .../* Open “file.h5” */ file_id = H5Fopen(“file.h5”, H5F_ACC_RDWR, H5P_DEFAULT);
/* Create a group "/B" in file. */ group_id = H5Gcreate(file_id,"/B",0);
/* Close the group and file. */ status = H5Gclose(group_id); Status = H5Fclose(file_id);
April 28, 2008 LCI Tutorial 71
Questions?
End of Part I