Industrial Data Science
in C# and .NET:
Simple. Fast. Reliable.
 
 

ILNumerics - Technical Computing

Modern High Performance Tools for Technical

Computing and Visualization in Industry and Science

ILNumerics Accelerator - Getting Started, Part I

In this tutorial we will get you going with ILNumerics Accelerator. We will present some simple examples and speed measurements. And you will learn about important aspects of the inner workings of the Accelerator compiler. 

Installation

No installation is required to work with ILNumerics Accelerator. It can be used on any licensed and activated ILNumerics developer seat. Note, that ILNumerics Accelerator requires a dedicated module license to be added to your seat. You may use a regular trial license.

In the following we will use Visual Studio 2022 and C# 8.0. Other versions do work but may require minor adjustments to the code.

First Accelerator Example

1. Create a new console application:

Open Visual Studio and create a new project: ConsoleApp1. Leave the defaults suggested by Visual Studio (targeting NET6.0, AnyCPU Console App - but most other configurations will do, too).

2. Import ILNumerics.Computing:

Right-click on the ConsoleApp1 node in Solution Explorer. Select "Manage Nuget Packages" to open the nuget package explorer. Find and install the ILNumerics.Computing package into the project.

3. Test the simple app:

Open Program.cs from the "Hello World" app created as a stub. Replace the content of the file with the following:

Hit F5 to start the app. The output should read something similar to this:

 4. A simple accelerator expression:

Replace the content of the file with the following, more interesting array expression:

yielding ...

Congratulations! You have just summed 10.000 columns, each containing the numbers between 1 and 10.000. Instead of displaying the result, we have computed this array twice and compare both results. Note, how the second expression is surrounded by two comments: //ILN(enabled = false) and //ILN(enabled = true). Their meaning will become obvious soon.

5. Bring in the Accelerator

During the pre-release phase the ILNumerics Accelerator compiler must be manually added as package dependency to your project. This allows you to switch back to your original code by simply removing the package again. Later, in the release version the Accelerator will be a permanent dependency to ILNumerics Computing. 

Use the nuget package manager to locate and import the ILNumerics.Accelerator.Command package to your project. Make sure to include pre-release versions when searching for the package in the beta phase! Confirm the license text box, if you agree to the conditions.

Hit F5 to build and start the app. The output is the same as before. What else did we expect ... !?

Observe the Build tab from the Visual Studio output tool window. It displays the following message:

ILNumerics Accelerator: no optimization was performed because the project setting 
'ILNAcceleratorEnabled' was: false. By default, Release builds are optimized. Debug builds are not.

ILNumerics Accelerator replaces code sections of array expressions with optimized versions thereof. While producing the same results the resulting code will look and execute very differently. Such code can be confusing when used for debugging. Hence, the ILNumerics Accelerator compiler is enabled only for Releae builds. This setting is configurable.    

7. Switch to Release Configuration

... and hit F5 to build and start the app.

Again, we get the same output: "Results equal: true". The warning message in the build output tool window has gone. This time, the accelerator was actually run. We can tell, because the app displays the number of segment invocations: 1. 

Verifying Accelerated Code

Two aspects are of utmost importance when it comes to code acceleration:

  1. The result must be the same.
  2. The result must be calculated faster.

Note: the first aspect deserves its own article, really. In short: as with all parallel executions results may not be exactly the same, when computed on different paths / in different order / on different devices. This is especially true for floating point calculations. But for high-level optimizations it is also true in a more general sense: depending on the fastest execution strategy and/or the fastest device segments must be implemented (compiled to low-level device codes) in very different ways! Instead of comparing results for exact (binary) identity you would check that the absolute difference between the results lays within the range of the expected error, given the algorithm used. For floating point operations, this error commonly corresponds to the round-off error, accumulated over the instructions involved. No-one said it would be easy ...

In order to verify correctness and to measure speed-up we compare accelerated expression results with the result of running the original expression the traditional way (using ILNumerics.Computing). Here, the code comments inserted earlier come into play. The Accelerator works like a state machine: when searching for suitable expressions in a code file the compiler starts with the default value for its enabled state: true. Right before the 2nd expression the comment //ILN(enabled = false) disabled the Accelerator. It keeps the disabled state until another code comment is found, enabling the Accelerator or the end of the file is reached. 

Now that we have ensured a correct result, let's focus on the second aspect: speed-up. Throughout this article we will use a simple scheme: expressions to be measured are repeatedly calculated: after 10 inner repetitions the average measured execution time is displayed and the whole measurement is repeated for 5 times. After applying this scheme to the original code we go on, measuring the accelerated code:

Compile & run this in Release mode and without a debugger attached (CTRL + F5) ! On our test computer this gave the following result:

So, while the first expression required 450 ms on average, the second expression - the one we have accelerated - required 0 ms on average! This sounds like quite some speed-up, doesn't it ?!

But wait! Measuring performance has always been a challenge - even in the sequential world. Now, that we live in a parallel world things become even more interesting! In general, whenever you hear someone claiming a speed-up factor of more than a magnitude or two – be careful and doubtful! It likely deserves a closer look...

A closer Look

ILNumerics Accelerator turns the marked expression from our example into a single segment. It not only removes all temporaries and constant parameters, makes an educated guess about the 'best' execution unit to process the segments instructions on, compiles the segment for the selected device, optimizes it for the concrete data and hardware properties, it also enqueues the segment to the device for asynchronous execution. 

The main thread immediately continues preparing subsequent segments. So, in our example, the main thread processes and enqueues all 10 inner iterations before eventually arriving at the watch.ElapsedMilliseconds expression in line 30. At this time, however, the segments work is not done yet! Segments were enqueued and are now in an undefined state: some may have finished processing already. Some are still waiting. Some may currently be executing. We don't know!

ILNumerics manages scheduling of segments for you. One especially important synchronization point is when access to the values of an array is required. In our example the first access happens in line 34, when comparing the results. Here, execution is halted until the processing is finished. At this time, all 5 outer iterations have been visited already. 

In order to make the measurement more fair we insert a manual synchronization instruction right before line 30, leaving other code untouched:

The Finish() member function waits until all pending computations writing to the array are completed. Running this code again (in Release mode, without a debugger attached) now gave the following results:

The speed-up is a factor of ~15x, on our machine. Be prepared to see very different results for different hardware! Note, that any manual synchronization likely introduces waiting times during execution which are not there if synchronization is left to the Accelerator. So, in practice, measuring performance can slow down your algorithm! 

Further and as a side note: the actual results, i.e.: the element values compared to the non-optimized version were of course correct also before adding a call to A2.Finish() ! We had  just skipped the waiting for measurement. Engineers in science and industries will have to reconsider how much value is in measuring individual execution branches and individual array operations on a too-low level of granularity. From now on, such array operations execute in parallel, hence, we must focus on larger computational parts instead! In the end, only the time to the final result is important! 

The measured speed-up can vary not only with your hardware. Let's modify the test code: this time we sum the elements along the rows instead of along the columns. To demonstrate the Accelerators ability to handle constant parameters we introduce a scalar constant variable: d = 1 and give it as parameter to the sum function(s). Here, it is important to declare the variable as const int: axis definitions must be compile time constants or the Accelerator would leave the instruction without optimization. Alternatively, of course, providing a constant numeric literal ('dim: 1') works, too.

Running this snippet produced the following results:

This time the speed-up is ~ 40x. Note, that the execution of the non-optimized part took more than twice as long! This is caused by the strong dependency the execution speed of many common implementations has on the storage order of the data processed. This dependency, however, does not exist for the optimized code. Its sum() implementation runs efficiently in any direction. It did not significantly change compared to the former example, where we have summed along the columns.

Note further, that the Accelerator adjusts the segment to the actual runtime situation. Therefore, multiple invocations are often required to reach the optimal execution speed. From the timings above the first three invocations led to significant improvement in execution speed, each. Afterwards, the optimal segment was found. 

Positive Side Effects by Acceleration

Many high-level optimizations applied by ILNumerics Accelerator have a positive effect on the stability and compatibility of your algorithms, which can be seen from the next example: by increasing the problem size to [1.000.000 x 10.000] we hit the limits of the available memory on our test computer. As expected, an OutOfMemoryException is thrown. The accelerated code, however, does consume much less memory. Hence, it completes without error:

Output:

While ILNumerics Accelerator uses much less – often no memory for (elided) intermediate results parallel execution may add other memory demands, though.  

Arrays as Input Data

So far, the input to sum() has been a binary expression on generator functions (arange(), ones()). Such functions are considered complex, non-scalar constants. ILNumerics Accelerator is often able to optimize them away. We'll finish this article with another example, this time receiving actual n-dimensional array data as input. It applies multiple binary bit manipulation functions to the data and sums the result:

Execution times are measured, again, of the first (non-optimized) expression and the optimized version afterwards. Running this snippet on our test machine in Release mode and without debugger attached gave:

We see a speed-up of a good magnitude (~13x). It is - among others - supported by the small element data type (uint), which gives room for larger parallel SIMD vector execution via AVX (8 elements on our CPU). Further - and equally important - the automatic parallelization via multiple CPU cores plays again its advantage here.

Traditional attempts to parallelize workloads like this split the input data into chunks and distribute them onto multiple cores. The attempt to calculate multiple output chunks in parallel requires a certain minimal workload to hide latencies and other overhead of parallelization. Not so for ILNumerics Accelerator! We don't split the input data but instead distribute whole expression invocations over suitable devices - here: CPU cores.

Profiling this snippet in VTune produced the following image (black marks added):

As can be clearly seen, without Accelerator the many iterations over the small workloads were not parallelized at all! In difference to that, the optimized code parallelizes nicely on 8 threads, leading to much better resource utilization.

But even this great speed-up leaves room for improvement. For example, more aggressive kernel optimization leads to a further decrease in execution times. Adding the following setting in line 30:

ILNumerics.Segment.Default.SpecializeFlags = ILNumerics.Core.Segments.SpecializeFlags.BSDsAll

... increases the speed-up to factor ~27:

Related articles

Getting Started, Part II - accelerate kmeans

ILNumerics Accelerator Configuration

faster-array-codes.html Introducing ILNumerics Accelerator