Python and HDF5 Andrew Collette University of Colorado
May 13, 2015
Python and HDF5Andrew Collette
University of Colorado
What makes scientific data special?
What makes scientific data special?
It’s meant to be shared - collaborative
Ad-hoc or changing structure - flexible
Archived and preserved - robust
Python and HDF5 together address all three
High-level language
Almost no “boilerplate” code
“Exception” error handling
Fully object-oriented
First-class module/namespace support
Readable
Self-documenting
Free
(the language)
(the platform)
Python itself is “batteries included”
Mature numerical, plotting and scientific modules
Hundreds of specialized science packages
Thousands more general-purpose
Core analysis packages
NumPy - Array objects and basic operations
SciPy - Advanced science & engineering library
Matplotlib - Publication-quality plots (both rendered and interactive)
Thousands of others
Unit testing - unittest module in stdlib
Only need to write code for your problem
Web servers and development - literally hundreds
Interface: F2PY (Fortran), Cython (C), ctypes, others
Distribution - distutils/pip single-command installs
Python highlights
Readable
Iteration
C
IDL
Python
Speed
SpeedFFTs and optimized routines built in to NumPy/Scipy
SpeedFFTs and optimized routines built in to NumPy/Scipy
ctypes and Cython
ctypesAdvanced foreign function interface
Call C libraries from pure Python code
CythonExample from the HDF5 C Library:
HDF5
HDF5
Hierarchical Data Format
File specification and object model
C library
Ecosystem of users and developers
3 things:
Objects
Datasets - Homogenous arrays of data
Groups: containers holding datasets and groups
Attributes: arbitrary metadata on groups & datasets
Standard constructs using these, or make your own!
Dataset featuresPartial I/O: read and write just what you want
Automatic type conversion
On-the-fly compression
(In Python, we even use the array-access syntax!)
Parallel reads & writes with MPI
(Directly from Python!)
Metadata & OrganizationGroups form a POSIX-style “filesystem” in the file
Attributes can store arbitrary data on arbitrary objects
How should the file be organized?
You decide! !
Thousands of domain-specific “application formats” Anyone can read them because HDF5 is self-describing!
Example
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Demo
Real-world use
UCLA Large Plasma Device
UCLA Large Plasma Device
Image credit: Basic Plasma Science Facility
Laser Experiment
Image credit: Basic Plasma Science Facility
LAPD Data ProductsAcquisition file - “Planes” of data in HDF5
Metadata:timestamps, digitizer settings, probe positions,
background plasma conditions…
Packaged into HDF5 following “lab layout” Users take their data back home and analyze
Visualization
Python 2D plotting
A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)
Only 160 lines of code!
A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)
Python does 3D too!“MayaVi” 3D visualizer
Development sponsored by Enthought
Both offline (scripted) and interactive modes
A. Collette et al. Phys. Plasmas 18, 055705 (2011)
CU Accelerator
CU Accelerator
CU Accelerator
CU Accelerator
CU AcceleratorRaw data HDF5 Shot file
Automated speed/mass calculation
MySQLData search
HDF5 file for user
Where to get Python
Where to get PythonDistributions are the best way to get started
(they include HDF5/h5py!)
Anaconda (Windows, Mac, Linux): http://continuum.io
PythonXY (Windows) http://pythonxy.googlecode.com
Questions?