HDF5 for efficient I/O

h5py

The dominant Python interface to HDF5 is h5py. It is included with many Python and most Linux distributions. For the examples here, I use the Anaconda Python [13] distribution for Python 2.7.

The examples I use here are fairly simple and are derived from the Quick Start page [14] on the h5py website. The first example simply illustrates a few concepts, such as opening an HDF5 file for writing, creating datasets, and creating groups. The simple Python script in Listing 1 incorporates these concepts.

Listing 1

test.py

 

The first h5py command in line 10 opens a file for writing. If the file exists, it is overwritten; if it doesn't exist, it is created. Remember that HDF5 is really a container for data objects. When you create a file, the library creates a number of defaults, such as the root group (/), so the file is non-zero in size, even if no data or attributes are written into it.

After the file is opened and created, line 11 creates a dataset (mydataset) with 100 integers. At this point, only the object for the dataset is created in the file (dataspace). Line 12 puts data into the data object using NumPy [15]. When you put data into the object or modify existing data, the h5py library takes care of updating the HDF5 file.

Recall that in Python almost everything is an object that has properties. Listing 1 prints the properties of the HDF5 file (line 16) and the first dataset (lines 13-15). Because HDF5 is object based, it fits well with the object nature of Python.

Line 17 creates a subgroup to the root group (subgroup); then, in line 18, a method of the group object creates a new dataset that resides in this subgroup using a float data type that starts with 50 elements. Line 20 creates a new dataset in a new subgroup named subgroup2. H5py will create the subgroup automatically if it doesn't exist. The output from this example Python script is show in Listing 2.

Listing 2

Output of test.py

 

Notice the size of the integers. The NumPy integer type represents integers with 32 bits (int32).

Another short Python script reads the HDF5 file and outputs some of the attributes. This can be done fairly easily using the h5py function visit (Listing 3), which walks the HDF5 file recursively, so you can discover the objects in the file, including groups and datasets. With this function, you can print the "names" of the objects. The output from the script is shown in Listing 4.

Listing 3

test2.py

 

Listing 4

Output of test2.py

 

You can find more information in the HDF5 documentation [16], and the Quick Start guide [14] has more examples of accessing HDF5 files from Python.

Fortran and HDF5

H5py is a very Python-centric library allowing HDF5 to be used in a very flexible manner, but compiled languages are a little different, so I also want to illustrate how to use HDF5 with a compiled language – in particular, Fortran. Using HDF5 with compiled languages is not quite as easy as with Python, but it is still not difficult. The developers of HDF5 have created a number of functions and subroutines to be used for manipulating data and objects in an HDF5 file that make programming straightforward.

For this example, I use a CentOS 7.3 system with the default Fortran compiler (gfortran) and the HDF5 library that is part of the distribution. It's not difficult to build a Fortran executable with gfortran and the HDF5 library. The generic command line below illustrates how,

$ gfortran code.f90 -fintrinsic-modules-path /usr/lib64/gfortran/modules -lhdf5_fortran -o exe

where code.f90 is the source file and exe is the resultant binary.

The HDF Group has provided some sample Fortran 90 code to get started, as well as more complex examples [17]. Listing 5 shows a Fortran 90 version of the first sample Python code from these examples.

Listing 5

Sample Fortran 90 Code

 

Notice that the code uses some predefined HDF5 variables that are necessary to use the library. Also note that this isn't "good" coding, in that the error variable is not checked when returning from a subroutine call. This code is just an example, and I wanted to keep it short in the interest of space.

The basic process of using HDF5 in Fortran is pretty logical. To begin, you initialize or enable the Fortran interface (line 57); then, you open a file (line 59) and start creating objects.

The first object to create is a dataset in the root (/) group, but first, you have to create the dataspace (line 62) then the dataset (line 64). Line 66 writes the data to the dataset. To reverse the process, first close the dataset (line 68) and then the dataspace (line 70).

The general approach for writing a dataset to an HDF5 file using Fortran is (1) open a dataspace, (2) open a dataset within the dataspace, (3) write the data to the dataset, (4) close the dataset, and (5) close the dataspace. You could easily write a function in Fortran 90 for all these steps if you desired.

Interestingly, that when using these subroutines, you have to use the full path to the group in which you are going to write the dataset. With the h5py Python module, you can write to a group by using the method associated with the specific group.

After running the Fortran code, which has no output, run the test2.py script from the Python section against the Fortran output:

$ ./test2.py
mydataset
subgroup
subgroup2
mydataset
subgroup
subgroup/another_dataset
subgroup2
subgroup2/dataset_three

If you compare this with the output from the Python code, you will see that they are the same.

HDF5 and Parallel I/O

A key attribute of HDF5 is Parallel HDF5, which is included in the source code and available through a configure option. Several processes, either on the same node or on different nodes, can write to the same file at the same time. This capability can reduce the time an application spends on I/O, because all of the processes are performing a portion of the I/O.

Amdahl's law says that your application will only go as fast as its serial portion. As an application is run over more processors, run time decreases. As the number of processors, N, goes to infinity, the wall clock time approaches a constant [1]. In general, this constant can be thought of as the "serial time" of the application (i.e., the amount of time the application needs to run regardless of the number of processes used).

Many times I/O is a significant portion of this serial time. If the I/O could be parallelized, the application could scale even further, which could improve application performance and the ability to run larger problems. The great thing about HDF5 is that, behind the scenes, it uses MPI-IO [18]. A great deal of time has been spent designing and tuning MPI-IO for various filesystems and applications, which has resulted in very good parallel I/O performance from MPI applications.

Here, I explore using HDF5 for parallel I/O. The intent is to reduce the serial portion of I/O to improve scalability and performance. Before jumping knee-deep into HDF5 parallel I/O, I'll explore MPI processes writing to separate datasets in the same HDF5 file.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News