HDF5 for efficient I/O
Fast Containers
HDF5 is a flexible, self-describing, and portable hierarchical filesystem supported by a number of languages and tools, with the ability to run processes in parallel.
Input/output operations are a very important part of many applications, sometimes involving a huge amount of data and a large number of reads and writes. Therefore, applications can use a very significant portion of their total run time to perform I/O, which becomes critical in Big Data, machine learning, and high-performance computing (HPC).
In a previous article [1], I discussed options for improving I/O performance, focusing on parallel I/O. One of the options mentioned was to use a high-level library to perform the I/O. A great example of such a library is the Hierarchical Data Format (HDF) [2], a standard library used primarily for scientific computing.
In this article, I introduce HDF5 and focus on the concepts and its strengths in performing I/O; then, I look at some simple Python and Fortran code examples, before ending with an example of parallel I/O with HDF5 and Fortran.
What Is HDF5?
HDF5 is a freely available file format standard and set of tools for storing and organizing large amounts of data. It uses a filesystem-like data format familiar to anyone who has used a modern operating system – thus, the "hierarchical" portion of the name. You can store almost any data you want in an HDF5 file, including user-defined data types; integer, floating point, and string data; and binary data such as images, PDFs, and Excel spreadsheets. Files written in the HDF5 format are portable across operating systems and hardware (little endian and big endian).
HDF5 also allows metadata (attributes) to be associated with virtually any object in the data file. Metadata is the key to useful data files, and attributes make HDF5 files self-describing (e.g., like XML).
An example of how you could structure data within an HDF5 file is described in an online tutorial [3] that shows how to use HDFView [4] (Figure 1) to view an HDF5 file of hyperspectral remote sensing data. Notice how the temperature data falls under a hierarchy of directories. At the bottom of the viewer, the metadata associated with that data displays when you click on a data value (the temperature).
A number of tools and libraries use HDF5 in your favorite language. For example, C, C++, Fortran, and Java are officially supported with HDF5 tools, but some third-party bindings (i.e., outside the official distribution) are also available for Python, Matlab, Octave, Scilab, Mathematica, R, Julia, Perl, Lua, Node.js, Erlang, Haskell, and others [5]-[9].
The HDF5 format can also accommodate data in row-major (C/C++, Mathematica) or column-major (Fortran, Matlab, Octave, Scilab, R, Julia, NumPy) order. The libraries from the HDF5 group are capable of compressing data within the file and even "chunking" the data into sub-blocks for storage. Chunking can result in faster access times for subsets of the data. Moreover, you can create lots of metadata to associate with data inside an HDF5 file.
Data in an HDF5 file can be accessed randomly, as in a database, so you don't have to read the entire file to access the data you want (unlike XML).
One of the most interesting capabilities of HDF5 is parallel I/O [10]. However, you might have to build HDF5 with an MPI library that supports MPI-IO, a low-level interface for carrying out parallel I/O that gives you a great deal of flexibility but also requires a fair amount of coding. Parallel HDF5 is built on top of MPI-IO to remove most of the pain of parallel I/O.
Storing Data in HDF5
HDF5 comprises a file format for storing HDF data, a data model for organizing and accessing HDF5 data, and the software, comprising libraries, language interfaces, and tools.
The file format is defined and published by the HDF Group. The HDF data model is fairly straightforward. Fundamentally, an HDF5 file is a container that holds data objects. Currently, eight objects either store data or help organize it:
- dataset * link
- group * datatype
- attribute * dataspace
- file * property list
Of these objects, datasets (multidimensional homogeneous arrays) and groups (container structures that can hold datasets and other groups) hold data, whereas the other objects are used for data organization.
Groups and Datasets
HDF5 groups are key to organizing data and are very similar to directories in a filesystem. Just like directories, you can organize data hierarchically so that the data layout is much easier to understand. With attributes (metadata), you can make groups even more useful than directories by adding descriptions.
HDF5 datasets are very similar to files in a filesystem. They hold data in the form of multidimensional arrays of elements. Data can be almost anything (e.g., images, tables, graphics, documents). As with groups, they also have metadata.
Every HDF5 file contains a "root" group that can contain other groups or datasets (files) or can be linked to other dataset objects in other portions of the HDF5 file. A root group is something like the root directory in a filesystem. In HDF5, it is referred to as /. If you write /foo
, then foo
is a member of the root group (it could be another group or a dataset or a link to other files) and looks something like a "path" in a filesystem.
Figure 2 show a theoretical HDF5 file layout. The root group (/) has three subgroups: A, B, and C. Groups /A/temp
and /C/temp
point to different datasets. However, Figure 2 also shows that datasets can be shared (like a symbolic link in filesystems): /A/k
and /B/m
point to the same object (a dataset).
Groups and datasets are the two most fundamental object types in HDF5. If you can remember them and use them, then you can start writing and reading HDF5 files in your applications.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.
-
ZorinOS 17.1 Released, Includes Improved Windows App Support
If you need or desire to run Windows applications on Linux, there's one distribution intent on making that easier for you and its new release further improves that feature.
-
Linux Market Share Surpasses 4% for the First Time
Look out Windows and macOS, Linux is on the rise and has even topped ChromeOS to become the fourth most widely used OS around the globe.
-
KDE’s Plasma 6 Officially Available
KDE’s Plasma 6.0 "Megarelease" has happened, and it's brimming with new features, polish, and performance.
-
Latest Version of Tails Unleashed
Tails 6.0 is based on Debian 12 and includes GNOME 43.
-
KDE Announces New Slimbook V with Plenty of Power and KDE’s Plasma 6
If you're a fan of KDE Plasma, you'll be thrilled to hear they've announced a new Slimbook with an AMD CPU and the latest version of KDE Plasma desktop.
-
Monthly Sponsorship Includes Early Access to elementary OS 8
If you want to get a glimpse of what's in the pipeline for elementary OS 8, just set up a monthly sponsorship to help fund its continued existence.