Performance on ILArray

Having a convenient data structure like ILArray<T> brings many advantages when handling numerical data in your algorithms. On the convenience side, there are flexible options for creating subarrays, altering existing data (i.e. lengthening or shortening individual dimensions on the run), keeping dimensionality information together with the data, and last but not least: being able to formulate an algorithm by concentrating on the math rather than on loops and the like.

Convenience and Speed

Another advantage is performance: by writing C = A + B, with A and B being large arrays, the inner implementation is able to choose the most efficient way of evaluating this expression. Here is, what ILNumerics internally does:

  • Pointer artithmetics – we remove the obligatory bound checks on array accesses by using C# pointers.
  • Cache aware implementations – the biggest bottleneck in modern processors is memory, as your know. ILNumerics implements all basic math functions in a way which allows to very efficiently profit from cache hierarchies in your processor. You will notice, if you time ILMath.sum over the rows of an array against a naive implementation.
  • Loops over large arrays are unrolled, saving a great amount of bookkeeping overhead.
  • Multithreading support – iterations over large arrays are split and distributed to multiple cores.

ILNumerics Accelerator

Here is, what we are planning to add to this list:

  • Utilizing SIMD vector extensions (SSE, AVX). Today, this parallel potential is accessed via the (native) MKL in ILNumerics. There has been a recent update to MKL 11.2 in ILNumerics Ultimate VS 4.4 which brings AVX2 support on corresponding Intel processors. For the future, we are planning to bring in extendend support for all managed evaluations as well. This will be introduced with the release of our ILNumerics Accelerator. Stay tuned – this will be a very exciting release… !
  • The Accelerator will also do onother highly efficient feature: removal of intermediate arrays. Expressions like C = A + 2 * B will no longer create an intermediate result for
    2 * B, but reformulate the expression before evaluation. Again: memory is the bottleneck and this will save a lot of memory transfer to the CPU.
  • Having a vast amount of GPUs available in basically every desktop computer today makes one wanting to use that – especially when waiting for your results. ILNumerics Accelerator will do it! It will decide on its own during runtime, which part of your algorithms are best executed on which available hardware and do it in the most efficient way possible. Need a further speedup? Just buy a better graphics card!

Limitations of n-dim arrays

Nothing comes for free, right? All the above is great for large data on vectorized algorithms. But what if we have a tight loop over small ILArrays or an algorithm which basically does a vast amount of scalar computations?

The flexibility of ILArray certainly comes at a price here. It becomes clear if you imagine the difference between a single double value and its corresponding representation in terms of a scalar ILArray<double>. The former – while being a value type lives on the stack rather than on the managed heap. It is not involved in garbage collection and there is basically no cost involved for its construction, access and destruction – at least when compared to the full featured implementation of ILArray.

ILArray, on the other hand, is a class instance. It lives on the heap instead of the stack storage. Creating an instance corresponds to the creation with ‘new’. For the destruction ILNumerics is able to completely prevent the GC having to handle ILArray instances. But even this memory management comes with a cost, naturally.

So there are two sides of the medal: flexibility and better performance for ‘large’ data. Flexibility and slower speed for tiny … to scalar data. For most algorithms, having a reasonable amount of management of small data does not hurt performance. But if such operations are to be done in tight loops, one starts looking for alternatives.

Alternatives

Especially for small data the flexible subarray features of ILArray are often not needed anyways.  Such scalar algorithm parts are much more efficiently implemented by using data structures that are especially designed for it: System.ValueType and System.Array. The good news: ILNumerics makes it very easy to resort to such system types instead. Here comes a comprehensive list of options:

Scalar operations on an ILArray<T>

Scalar access Returns Efficiency Documentation
A[0,1] ILArray<T> Improved since 4.3, but will not be fastest on scalars (ILNumerics Accelerator might change this!) Subarrays
A.GetValue()
A.SetValue()
T faster and efficient if you only need a single element occasionally Writing to Arrays
foreach (var a in A) { … } T very efficient iteration over all elements of A, gives copies as system types ILArray and LINQ
A.GetArrayForRead()
A.GetArrayForWrite()
T[] direct access to the underlying system array of A. Use this for hand tuned inner kernels or if you need access to native libraries. Array Import / Export

You may also consider the following references:
Stackoverflow posts dealing with the scalar access issue:
http://stackoverflow.com/questions/20944672/iteration-through-ilarraydouble-is-very-slow

http://stackoverflow.com/questions/19224626/performance-of-ilnumerics-of-simple-math-operations-over-vectors-vs-system-array