HDF5 for efficient I/O
Fast Containers

© Lead Image © Kirsty Pargeter, 123RF.com
HDF5 is a flexible, self-describing, and portable hierarchical filesystem supported by a number of languages and tools, with the ability to run processes in parallel.
Input/output operations are a very important part of many applications, sometimes involving a huge amount of data and a large number of reads and writes. Therefore, applications can use a very significant portion of their total run time to perform I/O, which becomes critical in Big Data, machine learning, and high-performance computing (HPC).
In a previous article [1], I discussed options for improving I/O performance, focusing on parallel I/O. One of the options mentioned was to use a high-level library to perform the I/O. A great example of such a library is the Hierarchical Data Format (HDF) [2], a standard library used primarily for scientific computing.
In this article, I introduce HDF5 and focus on the concepts and its strengths in performing I/O; then, I look at some simple Python and Fortran code examples, before ending with an example of parallel I/O with HDF5 and Fortran.
What Is HDF5?
HDF5 is a freely available file format standard and set of tools for storing and organizing large amounts of data. It uses a filesystem-like data format familiar to anyone who has used a modern operating system – thus, the "hierarchical" portion of the name. You can store almost any data you want in an HDF5 file, including user-defined data types; integer, floating point, and string data; and binary data such as images, PDFs, and Excel spreadsheets. Files written in the HDF5 format are portable across operating systems and hardware (little endian and big endian).
HDF5 also allows metadata (attributes) to be associated with virtually any object in the data file. Metadata is the key to useful data files, and attributes make HDF5 files self-describing (e.g., like XML).
An example of how you could structure data within an HDF5 file is described in an online tutorial [3] that shows how to use HDFView [4] (Figure 1) to view an HDF5 file of hyperspectral remote sensing data. Notice how the temperature data falls under a hierarchy of directories. At the bottom of the viewer, the metadata associated with that data displays when you click on a data value (the temperature).
A number of tools and libraries use HDF5 in your favorite language. For example, C, C++, Fortran, and Java are officially supported with HDF5 tools, but some third-party bindings (i.e., outside the official distribution) are also available for Python, Matlab, Octave, Scilab, Mathematica, R, Julia, Perl, Lua, Node.js, Erlang, Haskell, and others [5]-[9].
The HDF5 format can also accommodate data in row-major (C/C++, Mathematica) or column-major (Fortran, Matlab, Octave, Scilab, R, Julia, NumPy) order. The libraries from the HDF5 group are capable of compressing data within the file and even "chunking" the data into sub-blocks for storage. Chunking can result in faster access times for subsets of the data. Moreover, you can create lots of metadata to associate with data inside an HDF5 file.
Data in an HDF5 file can be accessed randomly, as in a database, so you don't have to read the entire file to access the data you want (unlike XML).
One of the most interesting capabilities of HDF5 is parallel I/O [10]. However, you might have to build HDF5 with an MPI library that supports MPI-IO, a low-level interface for carrying out parallel I/O that gives you a great deal of flexibility but also requires a fair amount of coding. Parallel HDF5 is built on top of MPI-IO to remove most of the pain of parallel I/O.
Storing Data in HDF5
HDF5 comprises a file format for storing HDF data, a data model for organizing and accessing HDF5 data, and the software, comprising libraries, language interfaces, and tools.
The file format is defined and published by the HDF Group. The HDF data model is fairly straightforward. Fundamentally, an HDF5 file is a container that holds data objects. Currently, eight objects either store data or help organize it:
- dataset * link
- group * datatype
- attribute * dataspace
- file * property list
Of these objects, datasets (multidimensional homogeneous arrays) and groups (container structures that can hold datasets and other groups) hold data, whereas the other objects are used for data organization.
Groups and Datasets
HDF5 groups are key to organizing data and are very similar to directories in a filesystem. Just like directories, you can organize data hierarchically so that the data layout is much easier to understand. With attributes (metadata), you can make groups even more useful than directories by adding descriptions.
HDF5 datasets are very similar to files in a filesystem. They hold data in the form of multidimensional arrays of elements. Data can be almost anything (e.g., images, tables, graphics, documents). As with groups, they also have metadata.
Every HDF5 file contains a "root" group that can contain other groups or datasets (files) or can be linked to other dataset objects in other portions of the HDF5 file. A root group is something like the root directory in a filesystem. In HDF5, it is referred to as /. If you write /foo
, then foo
is a member of the root group (it could be another group or a dataset or a link to other files) and looks something like a "path" in a filesystem.
Figure 2 show a theoretical HDF5 file layout. The root group (/) has three subgroups: A, B, and C. Groups /A/temp
and /C/temp
point to different datasets. However, Figure 2 also shows that datasets can be shared (like a symbolic link in filesystems): /A/k
and /B/m
point to the same object (a dataset).
Groups and datasets are the two most fundamental object types in HDF5. If you can remember them and use them, then you can start writing and reading HDF5 files in your applications.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.
News
-
Zorin OS 17 Beta Available for Testing
The upcoming version of Zorin OS includes plenty of improvements to take your PC to a whole new level of user-friendliness.
-
Red Hat Migrates RHEL from Xorg to Wayland
If you've been wondering when Xorg will finally be a thing of the past, wonder no more, as Red Hat has made it clear.
-
PipeWire 1.0 Officially Released
PipeWire was created to take the place of the oft-troubled PulseAudio and has finally reached the 1.0 status as a major update with plenty of improvements and the usual bug fixes.
-
Rocky Linux 9.3 Available for Download
The latest version of the RHEL alternative is now available and brings back cloud and container images for ppc64le along with plenty of new features and fixes.
-
Ubuntu Budgie Shifts How to Tackle Wayland
Ubuntu Budgie has yet to make the switch to Wayland but with a change in approaches, they're finally on track to making it happen.
-
TUXEDO's New Ultraportable Linux Workstation Released
The TUXEDO Pulse 14 blends portability with power, thanks to the AMD Ryzen 7 7840HS CPU.
-
AlmaLinux Will No Longer Be "Just Another RHEL Clone"
With the release of AlmaLinux 9.3, the distribution will be built entirely from upstream sources.
-
elementary OS 8 Has a Big Surprise in Store
When elementary OS 8 finally arrives, it will not only be based on Ubuntu 24.04 but it will also default to Wayland for better performance and security.
-
OpenELA Releases Enterprise Linux Source Code
With Red Hat restricting the source for RHEL, it was only a matter of time before those who depended on that source struck out on their own.
-
StripedFly Malware Hiding in Plain Sight as a Cryptocurrency Miner
A rather deceptive piece of malware has infected 1 million Windows and Linux hosts since 2017.