String Handling in HDF5
While the ASCII character set is still suprisingly wide-spread in industrial applications unicode string encoding is finally rising and today a defacto standard to many regards. In .NET there is only one single string data type: System.String. It stores its data as a variable length sequence of UTF16 encoded chars in memory, with exactly 2 bytes each. This makes it especially simple to store any known unicode character in such strings and to handle them transparently in very convenient ways.
However, what is easy on the .NET side turns out to be more complex in the HDF world. In HDF strings are represented (stored) in a variety of ways. They differ in respect to the storage format, the character encoding and other aspects. While in most situations users of ILNumerics.IO.HDF5 won't have to deal with those subtleties at all, our API allows to control every aspect of how strings are written to and read from HDF5 file. This article explains how.
The following topics are covered:
- Where are strings used in HDF5?
- Default string settings in HDF5 vers. ILNumerics.
- Handling strings in obect names and links.
- Strings as dataset elements.
- Creating string datasets
- Controlling the length, encoding and padding.
- Strings as attribute elements.
Where are Strings used in HDF5?
Strings are used in the following places in a HDF5 file:
- Link names (HDF5 paths to any object, hard links and soft links)
- Attribute names
- Attribute or dataset elements
- File names (handled by the OS, limited by HDF5)
- Field names of compound datatypes (soon to come)
- Error messages (all ASCII, currently)
ILNumerics allows to use any .NET (unicode) string of arbitrary length without any specific action at all those places. 'Use' includes: writing, reading (roundtrip) and modifying existing strings. Grayed-out places in the list above are not or not yet served explicitly by ILNumerics: File names are handled by the OS and underly certain limitations introduced by HDF5 itself. Error messages are simple enough so that they don't require special handling (however, such messages are much more easily provided to the user as for the low-level HDF5 API).
This leaves to ILNumerics the handling of strings for object names and as elements in attributes and datasets.
Default String Handling
If you are using ILNumerics to interface HDF5 files there will be almost no need to think about any storage format in the file. However, when your data must be read by external programs / other APIs it will be somehow important to understand the way strings are stored internally in HDF5 files. This will help you to make sure that your data will be readable without loss of information later. The same is true for reading data files which were created externally, of course.
ILNumerics' default setting for strings deviates slightly from the one in HDF5:
Common .NET System.String:
UTF16, variable length
ASCII / UTF8
|String Attribute Elements||
variable length/ fixed length
0-terminated / space padded
ASCII / UTF8
variable / fixed length
0-term / space padding*
|String Dataset Elements||
variable length/ fixed length
0-terminated / space padded
ASCII / UTF8
variable / fixed length
0-term / space padding*
This table shows various options for storing strings - when using ILNumerics API and when using the low-level HDF5 C-API directly. The default settings are printed bold. Note how the setting (*) for the padding in the low level HDF5 API depends on the language used: the C-API will create 0-terminated strings by default. The FORTRAN API would create space-padded strings instead.
It is recommended to make yourself familiar with the differences in representing strings in the various APIs. When working with ILNumerics you will always handle strings as System.String (string in C#, String in Visual Basic). In order to enable roundtrips when working with HDF5 files, strings in ILNumerics are stored as variable length strings, UTF8 encoded by default. This string type most closely matches the natural representation of System.String. However, ILNumerics allows to configure this storage format to match the behavior of any other API exactly. Read further below for more details on this topic.
When reading strings from HDF5 files the ILNumerics API will always return all strings as System.String - regardless of the actual storage format in the HDF5 file. So if you are referring to an object name which contains Japaneese characters and is stored as UTF8 bytes in the file –, if you are reading a dataset of fixed length ASCII strings, which were written in FORTRAN and contain trailing space characters – you will refer to them by regular .NET strings, as "数学は楽しいです". Array<string> will be returned respectively.
Strings in Names for Objects and Links
Any unicode character is valid in names of links to objects in HDF5 files.
Note how any name referencing an object can simply use arbitrary unicode characters. The ILNumerics API handles all necessary encoding for both directions: reading and writing. This is true for the utilization of tooltips in Visual Studio as well:
Controlling the Encodings in HDF5 names
Since ILNumerics allows unicode names out of the box it will mark all names in the HDF5 file as encoded in UTF8. For compatibility with legacy systems it can be required to store names as ASCII encoded bytes only. ILNumerics supports this with the help of a (global) flag:
By default ILNumerics.Settings.HDF5DefaultStringEncoding = StringEncoding.UTF8. When this is set to StringEncoding.ASCII any new object created by ILNumerics will expect only ASCII characters in the provided name and mark the name as ASCII encoded in the HDF5 file.
A note about ASCII / UTF8 compatibility
As you may know there is a certain property of UTF8 encoding which helps compatibility with legacy systems significantly: UTF8 was designed in a way that all ASCII characters a now a subset of the UTF8 character space. This means that regular ASCII characters are not affected by the UTF8 encoding. When trying to encode / decode any character from the ASCII set what comes out is the same result as if no encoding would have been applied.
This property comes in especially handy for compatibility with older systems: let's consider to create an HDF5 file which ought to be read by a legacy system B. B is mature enough to be only able of handling ASCII characters. Since you provide only ASCII characters as the names for your HDF5 objects the names will come out the same - regardless if they will be 'decoded' using UTF8 or not! In ILNumerics - again - you will not need to think about which encoding to use. When leaving ILNumerics.Settings.HDF5DefaultStringEncoding = StringEncoding.UTF8 (default) the names will be encoded but since only ASCII characters are contained the result will still be fit into the ASCII set.
However, there exists one reason to configure ILNumerics.Settings.HDF5DefaultStringEncoding properly here: when it is set to StringEncoding.ASCII ILNumerics will check all names provided and will throw an error whenever a non-ASCII character was found in a name.
Determining the Encoding Configured for an Object's Name
Which character encoding was configured at the time of creation of an objects link name can be determined by inspecting the NameEncoding property of an HDF5 object:
The setting is stored with the object at the time an object is created and cannot change afterwards. Therefore, the NameEncoding property is readonly.
Strings as Dataset Elements
Being able to name objects in arbitrary ways is fine and nice. But now the true fun begins: storing an arbitrary number of arbitrary strings as values of datasets and attributes.
In this example we created two new datasets, having 4 string elements each. One forms a vector of strings. The other creates a small matrix of two rows with two string elements each.
We have created these datasets in the same way as we would do with regular datasets. Since we didn't provide any specialy configuration the new string datasets are created with default configuration:
Default String Element Configuration
- Variable Length
- UTF8 encoding
The default settings are used whenever a new dataset is created by: new H5Dataset().
Controlling String Properties
In ILNumerics string datasets are represented by the class H5StringDataset. This class offers a collection of properties to control the configuration of string elements and to read such configuration back from the file. If we want to configure the way string datasets are created we create the new dataset as instance of H5StringDataset explicitly:
This way we can make use of additional constructor parameters of H5StringDataset:
- encoding - specify the character encoding for all string elements. UTF8 (default) or ASCII.
- length - specify the number of characters (fixed length strings) or that variable length is to be used (-1, default).
- padding - for fixed length strings the padding determines how to handle superfluous characters when a string is smaller as defined by 'length'. The default is to 0-terminate the strings.
See the online API reference for more infos.
Note that any such string parameters can be controlled at creation of the dataset only. After the dataset exists those parameters cannot be changed anymore. The only way to change a parameter after creation is to delete the dataset and to recreate a new one with the new parameters.
Variable Length vers. Fixed Length Strings
By default each string element of H5StringDataset stores only the number of characters (bytes) needed to encode the current string value. Since the string values can differ between individual cells of the dataset this storage scheme is refered to as 'variable length' strings. Variable length storage is the defaul in ILNumerics and is specified by providing a value of '-1' to the constructor of H5StringDataset. This scheme is efficient in terms of file space since only the required number of bytes is used. However, when exchanging string data with other applications one must make sure that the external application is able to read back such variable length strings.
Read more about the internal storage format of variable length strings in the official HDF documentation.
Another storage scheme is supported for easier compatibility with external applications: fixed length strings define a fixed storage space for each individual string element. Since each element uses the same space in the file this scheme may lead to wasted space in the HDF5 file as soon as many string elements are shorter than the fixed length. Or to put it in another way: in order to be able to store the full information of the strings one must carefully decide for the required length for all dataset cells - defined by the element with the highest storage requirement. Read below how this is painlessly done.
Fixed length storage is selected for new datasets by help of the 'length' parameter. Use any reasonable positive number to define the number of bytes used for each dataset element. For ASCII encoding the number of bytes simply equals the number of chars. However, for UTF8 encoding the number of bytes might be significantly higher! A simple and fast way of computing a reasonable number of bytes for fixed length storage would be presented by the following formulae:
length = max([numbers of chars in each element]) * 4.
However, the range of unicode character code points which use up 4 bytes per char in UTF8 is rather infrequently used in western countries. A better way to get a more efficient (smaller) number is to let ILNumerics figure out the length for you:
A value of '-2' for length will make ILNumerics iterate over all elements provided, determine the required number of bytes after UTF8 encoding and take the highest number as the fixed length for the dataset elements.
The value stored for the capacity of byte storage can be retrieved from an existing H5StringDataset by reading its Length property. Fixed length datasets will return a positive number. For variable length string datasets -1 is returned. Note that this property is readonly.
Since string datasets are common datasets after all, they can be extended and its values can be overwritten. Keep this in mind when creating the dataset! If you decide for fixed length storage the selected length will stay the same over the lifetime of the dataset. Providing any string which does not fit into the selected length capacity will lead to truncation and maybe even string content corruption to happen.
While HDF5 itself uses ASCII encoding by default, ILNumerics uses UTF8 for encoding the string values of all dataset elements. For a reasoning refer back to the encoding of object names. The encoding parameter determines the character set for the string elements. There should rarely be the need to set this parameter to anything else than StringEncoding.UTF8, except for compatibility with external applications which might be able to read ASCII encoded strings only. In this case you may use StringEncoding.ASCII.
Note that the setting of the encoding parameter does affect the corresponding flag as stored with the dataset in the HDF5 file only. The actual sequence of bytes ending up in the file will still be send through an UTF8 encoder, regardless of the encoding parameter value. The user is responsible not to store any characters in the dataset which do not fit into the ASCII character space. Note further that the result of UTF8 encoding will be indistinguishable from plain ASCII encoding results as long as there are only ASCII characters contained in a string.
Padding for Fixed Length Strings
When the strings are stored with a fixed length there needs to be a way to handle the following situation:
Consider the length parameter was set to 10 so that there is a space of 10 bytes reserved for each string element in the dataset. Now, if we decide to store a string as, let's say: "one". Since all 3 characters exist in the ASCII set the resulting string stored in the file will be 3 bytes long. However, the reserved space for the string (10 bytes) exists nevertheless - if we need it or not! What should we do with it?
Three options exist:
- Fill the remaining 7 characters with spaces (" " or ASCII code 32). This is referd to as StringPadding.SPACEPAD.
- Terminate the string with a '0' (zero byte) at position 4 and forget about the content of the remaining characters. This is called StringPadding.NULLTERM.
- Fill the remaining bytes 4..10 with '0'. Let's call this StringPadding.ZEROPAD.
The default value is StringPadding.NULLTERM. Again, for compatibility with other legacy applications you may want to change this value. StringPadding.SPACEPAD is a setting which would be used by FORTRAN APIs, since in FORTRAN strings commonly appear to be fixed length, space padded strings. Keep in mind that the padding parameter is ignored for variable length strings, which are always 0-terminated.
It must be noted that the setting of all three string configuration parameters does only affect the final strings as they are stored in the HDF5 file. On the ILNumerics or .NET side you will always provide Array<string> objects. The elements stored in these array objects will mostly be regular .NET strings, i.e.: variable length, 0-terminated UTF16 strings. However, depending on your application needs they may match the same settings as provided in length, encoding, and/or padding. So – as one example – it is fine to provide an strings which all have the same length and expose space padding at their ends. They can be stored in exactly the same way as a FORTRAN application would expect them, by using the corresponding values for length, encoding, and/or padding.
locating String DAtasets
When iterating over the children of a group or when locating datasets by help of one of the group filtering methods ILNumerics will return objects of the type H5StringDataset for datasets storing strings. The following example demonstrates several ways of accessing the same string dataset:
Since the objects are returned as H5StringDataset it is easy to cast them to the derived type H5StringDataset. This allows to access extended string properties, hence to determine the settings which are stored for the dataset. It also allows to modify existing dataset values by using methods of H5StringDataset, which are better adjusted to the special needs of string elements. Read on for details!
Modifying existing string datasets
String datasets are regular datasets. The values of string dataset can be altered using the same methods as for regular, numeric datasets. According to the initial setting of maxdims the dimensions can be extended and new strings can be stored into the newly extended dataset cells.
Here we show a full roundtrip, including the modification of an existing dataset. The dataset is created with a single string element. Afterwards, a vector of strings is used to modify the extent of the dataset and to store the strings into the new dataset cells. A check is made that the values of the dataset now exactly match the original string array.
Retrieving data from String Datasets
While reading data from regular datasets by use of the the generic functions Get<T>() and Get<T>(params ILBaseArray range) the user must explicitly provide the element type T for the result. Data from string datasets are read in a very similar way. It works with both: H5Dataset and H5StringDataset.
The Get<T>() function is called on an object of the dataset base class (H5Dataset). As expected the data of the dataset are returned as Array<string>. The same can be achieved by help of the derived, specific H5StringDataset objects:
This time we called Get() on the derived class H5StringDataset and got the same data. Note that this time the specification of the return data element type is not obligatory. Since there are string data stored in the dataset the only way to retrieve them is as strings. Hence we can ommit the generic type parameter here.
The full spectrum of partial I/O features is available for string dataset as for regular datasets. Refer to the dataset section for details.
Since HDF5 attributes are very similar to datasets the way strings are stored into attributes is very similar too. Basically, string attributes work pretty much like datasets - with the limitation that no partial I/O is available. String attributes are designed to store rather small arrays of strings. But this is not a fixed restriction.
For attributes exist the very same string configuration options as for datasets. Again, most the time you will create string attributes in the same way as you would with numeric attributes: by providing a sting value as the data. When non-default storage schemes are needed or properties of existing attributes need to be queried the H5StringAttribute class provides these options.
The next example demonstrates the creation of string attributes in various ways - using the short C# object initializer syntax.
String atrributes are also directly supported by Visual Studio® tooltips when inspecting HDF5 objects with ILNumerics in debug mode: