HDF5 for efficient I/O

Fast Containers

© Lead Image © Kirsty Pargeter, 123RF.com

© Lead Image © Kirsty Pargeter, 123RF.com

Article from Issue 205/2017
Author(s):

HDF5 is a flexible, self-describing, and portable hierarchical filesystem supported by a number of languages and tools, with the ability to run processes in parallel.

Input/output operations are a very important part of many applications, sometimes involving a huge amount of data and a large number of reads and writes. Therefore, applications can use a very significant portion of their total run time to perform I/O, which becomes critical in Big Data, machine learning, and high-performance computing (HPC).

In a previous article [1], I discussed options for improving I/O performance, focusing on parallel I/O. One of the options mentioned was to use a high-level library to perform the I/O. A great example of such a library is the Hierarchical Data Format (HDF) [2], a standard library used primarily for scientific computing.

In this article, I introduce HDF5 and focus on the concepts and its strengths in performing I/O; then, I look at some simple Python and Fortran code examples, before ending with an example of parallel I/O with HDF5 and Fortran.

What Is HDF5?

HDF5 is a freely available file format standard and set of tools for storing and organizing large amounts of data. It uses a filesystem-like data format familiar to anyone who has used a modern operating system – thus, the "hierarchical" portion of the name. You can store almost any data you want in an HDF5 file, including user-defined data types; integer, floating point, and string data; and binary data such as images, PDFs, and Excel spreadsheets. Files written in the HDF5 format are portable across operating systems and hardware (little endian and big endian).

HDF5 also allows metadata (attributes) to be associated with virtually any object in the data file. Metadata is the key to useful data files, and attributes make HDF5 files self-describing (e.g., like XML).

An example of how you could structure data within an HDF5 file is described in an online tutorial [3] that shows how to use HDFView [4] (Figure 1) to view an HDF5 file of hyperspectral remote sensing data. Notice how the temperature data falls under a hierarchy of directories. At the bottom of the viewer, the metadata associated with that data displays when you click on a data value (the temperature).

Figure 1: Example of the HDF5 data hierarchy in HDFView.

A number of tools and libraries use HDF5 in your favorite language. For example, C, C++, Fortran, and Java are officially supported with HDF5 tools, but some third-party bindings (i.e., outside the official distribution) are also available for Python, Matlab, Octave, Scilab, Mathematica, R, Julia, Perl, Lua, Node.js, Erlang, Haskell, and others [5]-[9].

The HDF5 format can also accommodate data in row-major (C/C++, Mathematica) or column-major (Fortran, Matlab, Octave, Scilab, R, Julia, NumPy) order. The libraries from the HDF5 group are capable of compressing data within the file and even "chunking" the data into sub-blocks for storage. Chunking can result in faster access times for subsets of the data. Moreover, you can create lots of metadata to associate with data inside an HDF5 file.

Data in an HDF5 file can be accessed randomly, as in a database, so you don't have to read the entire file to access the data you want (unlike XML).

One of the most interesting capabilities of HDF5 is parallel I/O [10]. However, you might have to build HDF5 with an MPI library that supports MPI-IO, a low-level interface for carrying out parallel I/O that gives you a great deal of flexibility but also requires a fair amount of coding. Parallel HDF5 is built on top of MPI-IO to remove most of the pain of parallel I/O.

Storing Data in HDF5

HDF5 comprises a file format for storing HDF data, a data model for organizing and accessing HDF5 data, and the software, comprising libraries, language interfaces, and tools.

The file format is defined and published by the HDF Group. The HDF data model is fairly straightforward. Fundamentally, an HDF5 file is a container that holds data objects. Currently, eight objects either store data or help organize it:

  • dataset * link
  • group * datatype
  • attribute * dataspace
  • file * property list

Of these objects, datasets (multidimensional homogeneous arrays) and groups (container structures that can hold datasets and other groups) hold data, whereas the other objects are used for data organization.

Groups and Datasets

HDF5 groups are key to organizing data and are very similar to directories in a filesystem. Just like directories, you can organize data hierarchically so that the data layout is much easier to understand. With attributes (metadata), you can make groups even more useful than directories by adding descriptions.

HDF5 datasets are very similar to files in a filesystem. They hold data in the form of multidimensional arrays of elements. Data can be almost anything (e.g., images, tables, graphics, documents). As with groups, they also have metadata.

Every HDF5 file contains a "root" group that can contain other groups or datasets (files) or can be linked to other dataset objects in other portions of the HDF5 file. A root group is something like the root directory in a filesystem. In HDF5, it is referred to as /. If you write /foo, then foo is a member of the root group (it could be another group or a dataset or a link to other files) and looks something like a "path" in a filesystem.

Figure 2 show a theoretical HDF5 file layout. The root group (/) has three subgroups: A, B, and C. Groups /A/temp and /C/temp point to different datasets. However, Figure 2 also shows that datasets can be shared (like a symbolic link in filesystems): /A/k and /B/m point to the same object (a dataset).

Figure 2: Theoretical HDF5 layout.

Groups and datasets are the two most fundamental object types in HDF5. If you can remember them and use them, then you can start writing and reading HDF5 files in your applications.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • TruPax 9

    The TruPax tool specializes in encrypting small datasets to safeguard your data from prying eyes.

  • FAQ – Common crawl project

    Download the entire web to kick-start a data science empire.

  • RAID Performance

    You can improve performance up to 20% by using the right parameters when you configure the filesystems on your RAID devices.

  • Perl: Google Chart Instructions

    A CPAN module passes drawing instructions in object-oriented Perl to Google Chart, which draws visually attractive diagrams.

  • Synkron

    Legacy backup programs are too heavyweight for a quick backup on the fly, but Synkron helps you keep smaller datasets in sync with just a couple of mouse clicks.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News