Industrial Data Science
in C# and .NET:
Simple. Fast. Reliable.

ILNumerics - Technical Computing

Modern High Performance Tools for Technical

Computing and Visualization in Industry and Science

tgt

# HDF5 Datasets in ILNumerics®

Datasets are the main data storage in HDF5 files. There is a great correspondance between ILNumerics Array<T> and HDF5 datasets:

• rectilinear shape
• arbitrary number of dimensions (currently limited to 32 by HDF5)
• expandable
• arbitrary element type
• ability to store arrays of arrays (Cells in ILNumerics)

In ILNumerics H5Dataset serves as a proxy object for an existing dataset in a HDF5 file. The data elements are only stored on the HDF5 side and not in the H5Dataset object. Therefore, there is no need to explicitely free or dispose such H5Dataset objects after use (likewise for all H5Objects in ILNumerics except H5File). H5Dataset exposes properties of the underlying dataset, allows the efficient transfer of data to and from the dataset in terms of Array<T> and the creation and management of new and existing datasets within a HDF5 file.

## Contents

Dataset Creation - describes how to create datasets using the ILNumerics API

Dataset Configuration - ways to configure HDF5 datasets, including fill values and compression

Reading from Datasets - how to read from HDF5 datasets using the ILNumerics API

Writing to Datasets - how to write to HDF5 datasets with ILNumerics

Supported Datatypes - element types that are supported by ILNumerics and the way they map to HDF5 native datatypes

## Dataset Creation

New datasets in ILNumerics HDF5 files are created by using the new operator (New in Visual Basic). A name and initial data are the only required parameters. The dataset will be persisted in the HDF5 file only after being added to an existing group.

The name of the dataset must be unique among all objects in a group. Trying to add a dataset to a group with the same name as another object in the group will create an error. Names for datasets can contain alphanumeric characters, digits and underscores.

The initial data given to the dataset at the time of creation deserves some attention: Datasets in ILNumerics are always created chunked. 'Chunks' are a feature of the HDF5 library which – among other advantages – allow the dataset to be expanded by adding new data after it was created. The dataset will grow in chunks of equal size. The chunksize is set once for a dataset at the time of creation and cannot be changed afterwards. In ILNumerics the initial data given to the dataset in the constructor determines the chunksize for the dataset.

## Dataset Configuration

Several optional parameters in the constructor of H5Dataset allow the configuration of storage properties for a dataset. Those properties are valid for the whole lifetime of the dataset, they cannot be changed once the dataset exists.

### Compression

ILNumerics currently supports two types of compression for datasets: Deflate (gzip compression) and SZIP compression.

The type of compression is set by the compression parameter in the constructor of H5Dataset. A second parameter 'compParameter' allows further configuration of the compression method.

For the deflate compression method, compParameter defines the level of compression. 0 means: no compression, 9 corresponds to the strongest compression - being the slowest variant at the same time.

For SZIP compression parameters, consult the online manual of the HDF5 SZIP API.

In order to retrieve the configured values for compression and compression parameter- the properties H5Dataset.Compression and H5Dataset.CompParameter are used.

## Important Note about SZIP Licensing

The license for the SZIP algorithm included in ILNumerics allows the decompression of data without limitation for commercial use. The compression encoder, however, does not allow commercial use. This means, you can use ILNumerics for reading arbitrary SZIP compressed data. But in order to write your own SZIP compressed datasets, you will have to acquire an individual license for SZIP compression.

Further details:

http://www.hdfgroup.org/doc_resource/SZIP

### Fill Value

When new data is written to existing datasets, the dataset is automatically expanded if necessary. This may allocate storage areas in the dataset where no element values are defined. HDF5 automatically assigns 0 (zero) by default for most basic scalar datatypes. This 'FillValue' can be configured with a custom value.

As for the compression, the fill value must be set in the constructor of H5Dataset. It can be read from existing datasets but not changed after creation. In order to read fill values, one must specify a datatype used to return the configured fill value. H5Dataset.GetFillValue<T>() is used for retrieving the configured fill value:

Dataset Creation Example

The following example creates a new dataset and configures a custom fill value as well as deflate (gzip) compression for it. It then expands the dataset by assigning a new value to an element position outside of the current dimension limits.

### Size

The size of a dataset is retrieved by the H5Dataset.Size property. It cannot be changed explicitely. The H5Dataset.Set(A,range) function is used to alter existing datasets.

The H5Dataset.Get<T>(range) function on datasets is used to read arrays stored as datasets into ILNumerics arrays. The function optionally allows the definition of subranges. If omitted, the whole dataset is retrieved and transferred (copied) to a new array of the given element type T:

Note, the type parameter T for H5Dataset.Get<T>() does not need to match the element type of the data actually stored in the file. In the example above, we have created a dataset for double precision elements. In the subsequent access via H5Dataset.Get<T>() we used single precision float for T. The operation did not produce an error but returned the correct values - converted to single precision. The HDF5 library handles all necessary conversions for us. The element type T must only be compatible with the actual element type of the elements stored in the file. For all common system defined scalar numeric value types and combinations conversions exist.

## Partial I/O - Reading HDF5 Datasets with Hyperslabs

If a range parameter is provided to H5Dataset.Get<T>(). it will be used to define the part of the dataset to retrieve. All range specifications valid for regular ILNumerics arrays are allowed for range. This includes any combination of ranged dimension specifiers, stepped ranges (increasing or decreasing), combined dimension specifiers and sequential indices. If a range contains a single numeric array of arbitrary shape, its elements are interpreted as sequential indices into the dataset. Read all details about subarray addressing options in the ILNumerics Subarray Tutorial.

Some examples of partial dataset retrievals:

ILNumerics supports the hyperslab feature of HDF5. Hyperslabs allow the partial retrieval of sections from the dataset. The ranges given for the subarray feature of ILNumerics work in a similar manner. However, HDF5 does not allow all ranges to be translated to hyperslabs. ILNumerics, therefore tries to convert the range to hyperslabs whenever possible and falls back to a less performant retrieval method in case the translation into hyperslabs is not possible.

See the performance note below for a discussion of which ranges allow the translation to hyperslabs.

## Writing Data to HDF5 Datasets

Writing to datasets works similar to reading datasets. H5Dataset.Set(A, range) is used to provide the data elements to be written to the dataset in A. Optionally, a range specifies the area to overwrite in the dataset.

### Full Writes

If range is omitted, the whole dataset will be replaced by A. H5Dataset.Set(A) without providing a range parameter will not only replace the elements in the dataset at the beginning of each dimension but will also resize the dataset to match the size of A. At the same time, this is currently the only way to shrink an H5Dataset.

### Partial Writes

The range parameter in H5Dataset.Set(A,range) can be used to determine the area, where the data from A should be written to inside the dataset. The rules for range are similar as for read access and for Array<T>[range] subarray expressions. However, here the range must be convertible to hyperslabs, in order to be applicable. Defining a range inside the current dimensions of the dataset will cause the replacement of the affected elements:

If the range defines any elements outside of the current dimension limits the dataset is enlarged just as much as needed for A to be stored. The expansion may allocate new elements in the dataset which are not addressed by range - hence not replaced with values of A. Those elements will be set to the fill value configured for the dataset:

Here, a fill value of -99 was configured for the dataset in the constructor. The dataset has a size of 2x3 elements. By setting the values in the 3rd and 4th row, 4th and 5th column the dataset is expanded to a size of 4x5 elements. All elements newly allocated will get the fill value assigned as configured for the dataset. The content of A is then used to replace the elements addressed by range.

Note that no type parameter is required for H5Dataset.Set(A,range). A may be of any element type convertible to the actual element type in the dataset. HDF5 takes care of those conversions for us.

## A Note on Performance. Hyperslabs in ILNumerics®

ILNumerics prefers to use hyperslab selections from HDF5 in order to retrieve and store subarrays from/ to datasets. This allows one to read/write parts of the dataset without accessing the whole dataset. Using hyperslab selections on large datasets can drastically reduce the memory requirements and therefore improve the performance of the transfer.

In order to use the hyperslab feature of HDF5, no actions are neccessary from the user side. ILNumerics automatically translates the subarray specification given by range into corresponding hyperslab selections. However, since the way hyperslabs work is not 100% compatible with ILNumerics' subarray features, the conversion and highly efficient subarry access is only possible for a certain subset of range specifications. The following requirements must be met, in order for a range definition to be compatible with hyperslabs:

• If a range is given, it must address the same number of dimensions as inside the file. I.e., if the dataset in the file is a 3 dimensional dataset, range must address all three dimensions. Omitting any dimension (as is possible for trailing dimensions with ILNumerics arrays) will lead to a less efficient subarray read access using Get<T>() without using hyperslabs.
• There must be exactly one dimension specifier for each dimension in range. Specifying multiple ranges for a dimension will lead to a less efficient access.
• The single dimension specifier must be a simple range specification. All ranges must be upwards counting. Downward counting ranges are only retrieved without using hyperslabs.
• If individual indices are used as dimension specifiers, those indices must resolve to a selection which would also be achievable via a simple regular (stepped) range.
• For writing to datasets the new value (right side) must be broadcastable to the area addressed by the range specification.

The following table gives some examples for subarray access on a HDF5 dataset ds of size [20x10] and classifies them for efficient hyperslab enabled access and for hyperslab disabled access.

Subarray

using hyper-

slabs ?

why not?
Examples for dataset subarray access; w/o using HDF5 hyperslabs

ds.Get<double>("1;3");

ds.Get<double>(1,3);

yes --

ds.Get<double>("1;1:3");

ds.Get<double>(1,r(1,3);

yes --

ds.Get<double>("1,2,3;1:10");

ds.Get<double>(cell(1,2,3),r(1,10));

yes

(first dimension is

defined explicitely but still resolves to a simple regular range

ds.Get<double>("1,2,4;1:10");

ds.Get<double>(cell(1,2,4),r(1,10));

no

range for 1st

dimension is not

regular

ds.Get<double>("1:5,0:5;0:end");

ds.Get<double>(cell(r(1,5),r(0,5)),r(0,end))

no more than one dimension specifier for 1st dimension

ds.Get<double>("200");

ds.Get<double>(200);

no

only one dimen-

sion addressed for two dimensional dataset

ds.Get<double>("1:end");

ds.Get<double>(r(1,end));

no

only one dimen-

sion addressed for two dimensional dataset

ds.Get<double>("10:-2:0;2");

ds.Get<double>(r(10,-2,0),2);

no downward counting range
ds.Get<double>(A); ? depends on content of A

## Supported Datatypes

The following element datatypes are currently supported for HDF5 datatypes and attributes by ILNumerics. By 'support' we mean: you can write data from Array<T> with the following element types T into HDF5 attributes and datasets. The table also displays the corrresponding H5T_NATIVE element type mapping:

ILNumerics (.NET) Datatype HDF5 Datatype
System.Byte H5T_NATIVE_UINT8
System.SByte H5T_NATIVE_INT8
System.Int16 H5T_NATIVE_INT16
System.UInt16 H5T_NATIVE_UINT16
System.Int32 H5T_NATIVE_INT32
System.UInt32 H5T_NATIVE_UINT32
System.Int64 H5T_NATIVE_INT64
System.UInt64 H5T_NATIVE_UINT64
System.Char H5T_NATIVE_UINT16
System.Single H5T_NATIVE_FLOAT
System.Double H5T_NATIVE_DOUBLE
System.Boolean H5T_NATIVE_INT8

The Reading of existing datasets is not limited by this list! The underlying HDF5 library will manage all datatype conversions even for element datatypes which are not in this list. As one example, you can read any little / big endian element type from an existing H5Dataset into Array<int> and will get correct results.