HDF5 for efficient I/O
Dataset Details
A dataset object comprises data values and the metadata that describes it. A dataset has two fundamental parts: a header and a data array. The header contains information about the data array portion of the dataset and the associated metadata. Typical header information includes the name of the object, dimensionality, number type, information about how the data is stored on disk, and other information that HDF5 can use to speed up data access or improve data integrity.
The header has four essential classes of information: name, datatype, dataspace, and storage layout. A name in HDF5 is just a set of ASCII characters, but you should use a name that is meaningful to the dataset. A datatype in HDF5 describes the individual data elements in a dataset and comprises two categories or types: atomic and compound. Datatypes can be quite complicated to define, so I focus only on the basics. Some predefined datatypes [11] can be used for the data you typically might encounter.
The atomic datatype includes integers, floating-point numbers, and strings. Each datatype has a set of properties. For example the integer datatype properties are size, order (endianness), and sign (signed/unsigned). The float datatype properties are size, location of the exponent and mantissa, and location of the sign bit.
Compound datatypes refer to collections of several datatypes that are presented as a single unit. In C, this is similar to a struct
. The various parts of a compound datatype are called members and may be of any datatype, including another compound datatype. One of the fancy features of HDF5 is that it is possible to read members from a compound datatype without reading the whole type.
The layout of a dataset's data elements can consist of non-elements (NULL
), a single element (a scalar), or a simple array. The dataspace can be fixed or unlimited, which allows it to be extensible (i.e., it can grow larger).
Dataspace properties include rank (number of dimensions), size (dimensions), and maximum size (size to which an array may grow). The dimensionality (rank) of the dataspace is fixed when the array is created and can include a maximum size that each dimension can grow during the lifetime of the dataspace.
If you are not sure what dimensions your dataspace might become, you can always use the HDF5 predefined variable H5P_UNLIMITED
.
Attributes
One of the fundamental objects in HDF5 is an attribute, which is how you store metadata inside an HDF5 file. Optionally, attributes can be associated with other HDF5 objects, such as groups, datasets, or named datatypes if they are not independent objects. As such, attributes are accessed by opening the object to which they are attached.
As the user, you define the attributes (make it meaningful), and you can delete them and overwrite them as you see fit.
Attributes have two parts. The first is a name, and the second is a value. Classically, the value is a string that describes the data to which it is attached. They can be extremely useful in a data file. Using attributes, you can describe the data, including information such as when the data was collected, who collected it, what applications or sensors were used in its creation, a description (with as much information as you can include), and so on. A lack of useful metadata is one of the biggest problems in HPC data today, and attributes can be used to help alleviate the problem. You just have to use them.
HDF5 Basics
In this section, I want to present a quick introduction to HDF5 through some simple code examples. The goal is not to dive deep into HDF5 but to illustrate the basics in practice. I'll start with Python because it is a widely used language, and the HDF5 Python library h5py [12] is very easy to use and very easy to understand.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs