ILNumerics Ultimate VS

H5StringDataset Constructor

ILNumerics Ultimate VS Documentation
ILNumerics - Technical Application Development
Create a new dataset, storing an array of strings.

[ILNumerics HDF5 Module]

Namespace:  ILNumerics.IO.HDF5
Assembly:  ILNumerics.IO.HDF5 (in ILNumerics.IO.HDF5.dll) Version: 5.5.0.0 (5.5.7503.3146)
Syntax

public H5StringDataset(
	string name,
	InArray<string> data,
	BaseArray maxdims = null,
	StringEncoding encoding = StringEncoding.UTF8,
	Nullable<int> baseType = null,
	int length = -1,
	Nullable<StringPadding> padding = null
)

Parameters

name
Type: SystemString
The name of the new dataset.
data
Type: ILNumericsInArrayString
String array to be stored in the new dataset.
maxdims (Optional)
Type: ILNumericsBaseArray
[Optional] The maximum length for each dimension in data. Default: null (unlimited dimension length).
encoding (Optional)
Type: ILNumerics.IO.HDF5StringEncoding
[Optional] The encoding used to store the strings in the dataset elements. Default: UTF8.
baseType (Optional)
Type: SystemNullableInt32
[Optional] The base type of the strings as stored in the HDF5 file. Default: null (derived from vlen string class).
length (Optional)
Type: SystemInt32
[Optional] Determine if the string elements are of fixed or variable length. Default: -1 (variable length).
padding (Optional)
Type: SystemNullableStringPadding
[Optional] The padding used for fixed length strings (length is (-2, 0, 1, [pos. integer])). Default: null (depends on baseType).
Remarks

By default new string datasets in ILNumerics.IO.HDF5 are created as variable length strings, similar to C type strings with UTF8 encoding. This type of strings most closely corresponds to the common String strings in .NET.

If the string elements must be stored as fixed length strings or if ASCII encoding must be used instead or if one needs more control about the fixed length parameters (length, padding) the optional parameters encoding, baseType, length, and padding can be used to modify the default settings. A typical situation where this may become necessary is the need to exchange data between existing, unmanaged applications / other frameworks / other APIs.

name is used to identify the new dataset in the collection of datasets of the hosting object. The name can be any string, including special characters from the whole unicode character space. When any character in name is not compatible with the common ASCII character set make sure that HDF5DefaultStringEncoding is set to its default value of UTF8.

data is a n-dimensional ILNumerics array of arbitrary dimensionality and size with string elements of arbitrary lenghts. The size of data determines the initial size of the dataset chunks in the file. However, one can use the [!:H5Dataset.Set(BaseArray, BaseArray[])] function on existing datasets in order to change their size afterwards. Selecting a reasonable size for data at creation helps to optimize the performance for later data retrieval / writing by optimizing the chunksize for your specific data size.

The encoding parameter is used to specify the content encoding of the dataset's string elements. The default value of UTF8 allows to store any unicode character and to make a full roundtrip by reading it back into ILNumerics string arrays.

If for some (rare) reasons one needs to store the elements as ASCII encoded bytes the encoding parameter can be set to ASCII.

Let's stress the fact that HDF5 itself uses ASCII as the default decoding! However, since ASCII is a subset of UTF8 any ASCII string stays exactly the same when it becomes UTF8 encoded. Hence, no compatibility issues are expected when using the default encoding (UTF8) in ILNumerics.IO.HDF5 with ASCII strings: in the file both versions will be byte-compatible. Using any ASCII-encoding-only-aware program on such strings will give the same result as if the string was stored with ASCII encoding explicitly.

baseType can optionally be used to specify the HDF5 base type for storing the files on disk. Two common string base types are popular here: C_S1 (default) and FORTRAN_S1. While the former most closely corresponds to common variable length, null-terminated C strings, the latter mimics common strings stored from FORTRAN (fixed length, space padded). Any other predefined datatype may be used as the source datatype here. Make sure that the base type corresponds to a class of STRING.

The length parameter controls the length of individual elements as they are stored in the HDF5 file:

  • - elements are stored as fixed length strings in the file. I.e.: the number of bytes used for all elements is the same. The actual number is determined automatically by ILNumerics according to the string elements provided in data and the settings of encoding as well as padding. Note that this parameter controls the number of bytes after encoding the chars from data and after applying any potential 0-padding / termination to the encoded bytes. Therefore the resulting number may be larger than expected! Especially for UTF8 encoding and when non-ASCII characters are used it will certainly be larger than the number of characters in data's elements. '-2' is the recommended value for storing fixed length strings.
  • - elements are stored as variable length strings. I.e.: the number of bytes used for storing individual strings from data may differ. This is the default setting in ILNumerics. It uses the smallest storage when storing strings of varying lengths.
  • >= 0
    - a positive number specifies a fixed number of bytes used to store the strings from data. This corresponds to the fixed length strings created by , except that you are responsible to figure out the optimal number of bytes to use. If this number is too large, you will waste storage in the resulting file. If the length is too small to fit all (encoded, null-terminated) strings converted to byte arrays - truncation will happen on the strings! It is recommended to let ILNumerics figure out the optimal setting for the number of bytes required to store all characters of the string without loosing any information by using '-2' for length in the fixed length case.

The padding parameter controls what will be stored at the end of such strings which are smaller than the fixed length given by length. Commonly this will be set to NULLTERM (default). This will make sure to terminate any string element with a 0-byte value. Other settings allow to create strings in the same manner as, let's say FORTRAN would do: SPACEPAD in combination with a fixed length creates all space padded strings of the same length - without any 0-termination. This way it is possible to create any string dataset configuration and to ensure compatibility with external programs with limited capabilities.

For element base types of C_S1 (default) the default padding is NULLTERM. For element base types of FORTRAN_S1 padding will be set to SPACEPAD by default. The default padding for other element types is undefined.

The padding parameter is not used for variable length strings (length = -1). Such strings are always stored with a padding of NULLTERM.

The dataset is only created in memory. Its actual creation in a file is delayed until the dataset is added to a group as part of a HDF5 file.

Examples

using (var f = new H5File(fname)) {
                // create new dataset with the name "uniqueName", provide initial data and add to root group
                f.Add(new H5Dataset("uniqueName",ILMath.ones(10,20)); 
            }

Datasets in ILNumerics are always created chunked! The chunk size is implicitly derived from the initial data used.

[ILNumerics HDF5 Module]

See Also

Reference

Other Resources