Industrial Data Science
in C# and .NET:
Simple. Fast. Reliable.
 
 

ILNumerics - Technical Computing

Modern High Performance Tools for Technical

Computing and Visualization in Industry and Science

ILNumerics Accelerator: Getting Started, Part I

Faster Low Level Expressions

In this set of tutorials we will get you going with ILNumerics Accelerator. We will start with some simple examples and speed measurements. You will learn about the various aspects of the ILNumerics Accelerator compiler and how it makes your code faster on a low and on a high granularity level.

Installation

No installation is required to work with ILNumerics Accelerator. It can be used on any licensed and activated ILNumerics developer seat. Note, that ILNumerics Accelerator requires a dedicated module license to be added to your seat. You may use a regular trial license.

In the following we will use Visual Studio 2022 and C# 8.0. Other versions do work just the same.

First Accelerator Example

1. Create a new console application:

Open Visual Studio and create a new project: ConsoleApp1. Leave the defaults suggested by Visual Studio (target the most recent .NET version SDK installed, AnyCPU Console App - but most other configurations will do, too).

2. Import ILNumerics.Computing:

Right-click on the ConsoleApp1 node in Solution Explorer. Select "Manage Nuget Packages" to open the nuget package explorer. Find and install the ILNumerics.Computing package into the project.

3. Test the simple app:

Open Program.cs from the "Hello World" app created as a stub. Replace the content of the file with the following:

Hit F5 to start the app. The output should read something similar to this:

 4. A simple accelerator expression:

Replace the content of the file with the following, more interesting array expression:

yielding ...

Congratulations! You have just summed 10.000 columns, each containing the numbers between 1 and 10.000. Instead of displaying the result, we have computed this array twice and compare both results. Note, how the second expression is surrounded by two comments: //ILN(enabled = false) and //ILN(enabled = true). Their meaning will become obvious soon.

5. Bring in the Accelerator

During the pre-release phase the ILNumerics Accelerator compiler must be manually added as package dependency to your project. This allows you to easily switch back to your original code by simply removing the package from the project. 

Use the nuget package manager to locate and import the ILNumerics.Accelerator.Command package to your project. Make sure to include pre-release versions when searching for the package in the beta phase! Confirm the license text box, if you agree to the conditions.

Hit F5 to build and start the app. The output is the same as before. What else did we expect ... !?

Observe the Build tab from the Visual Studio output tool window. It displays the following message:

ILNumerics Accelerator: no optimization was performed because the project setting 
'ILNAcceleratorEnabled' was: false. By default, Release builds are optimized. Debug builds are not.

ILNumerics Accelerator replaces code sections of array expressions with optimized versions thereof. While producing the same results the resulting code will look and execute very differently. Such code can be confusing when used for debugging. Hence, the ILNumerics Accelerator compiler is enabled only for Release builds. This setting is configurable.    

7. Switch to Release Configuration

... and hit F5 to build and start the app.

Again, we get the same output: "Results equal: true". The warning message in the build output tool window has gone. This time, the accelerator was actually run. We can tell, because the app displays the number of segment invocations: 1. 

Verifying Accelerated Code

Two aspects are of utmost importance when it comes to code acceleration:

  1. The result must be the same.
  2. The result must be calculated faster.

Note: the first aspect deserves its own article, really. In short: as with all parallel executions results may not be exactly the same, when computed on different paths / in different order / on different devices. This is especially true for floating point calculations. But for high-level optimizations it is also true in a more general sense: depending on the fastest execution strategy and/or the fastest device segments must be implemented (compiled to low-level device codes) in very different ways! Instead of comparing results for exact (binary) identity you would check that the absolute difference between the results lays within the range of the expected error, given the algorithm used. For floating point operations, this error commonly corresponds to the round-off error, accumulated over the instructions involved. No-one said it would be easy ...

In order to verify correctness and to measure speed-up we compare accelerated expression results with the result of running the original expression the traditional way (using ILNumerics.Computing). Here, the code comments inserted earlier come into play. The Accelerator works like a state machine: when searching for suitable expressions in a code file the compiler starts with the default value for its enabled state: true. Right before the 2nd expression the comment //ILN(enabled = false) disabled the Accelerator. It keeps the disabled state until another code comment is found, enabling the Accelerator or the end of the file is reached. 

Thus, we have configured our code in a way that A is calculated with acceleration, while A2 is calculated the "traditional" way, i.e.: without acceleration and using the well known, established functions from ILNumerics Computing Engine (ILMath, etc.). We now that A2 is correct. So we can simply compare the results to check correctness for the accelerated version A.

Now that we have ensured a correct result, let's focus on the second aspect: speed-up. Throughout this article we will use a simple scheme: expressions to be measured are repeatedly calculated: after 10 inner repetitions the average measured execution time is displayed and the whole measurement is repeated for 5 times. After applying this scheme to the original code we go on, measuring the accelerated code:

Compile & run this in Release mode and without a debugger attached (CTRL + F5) ! On our test computer this gave the following result:

So, while the first expression required 185 ms on average, the same - but accelerated expression required 0 ms on average! This sounds like quite some speed-up, doesn't it ?!

But wait! Measuring performance has always been a challenge - even in the sequential world. Now, that we live in a parallel world things become even more interesting! In general, whenever you hear someone claiming a speed-up factor of more than a magnitude or two – be careful and doubtful! It likely deserves a closer look.

A closer Look

ILNumerics Accelerator turns the marked expression from our example into a single segment. It not only removes all temporaries and constant parameters, makes an educated guess about the 'best' execution unit to process the segments instructions on, compiles the segment for the selected device, optimizes it for the concrete data and hardware properties, it also enqueues the segment to the device for asynchronous execution. 

Without spending cycles waiting for the result the main thread immediately continues preparing subsequent segments. So, in our example, the main thread processes and enqueues all 10 inner iterations before eventually arriving at the watch.ElapsedMilliseconds expression in line 30. At this time, however, the segments work is not done yet! Segments were enqueued and are now in an undefined state: some may have finished processing already. Some are still waiting. Some may currently be executing. We don't know!

ILNumerics manages scheduling of segments for you. One especially important synchronization point is when access to the values of an array is required. In our example the first access happens in line 34, when comparing the results. Here, execution is halted until the processing is finished. At this time, all 5 outer iterations have been visited already. 

In order to allow us the measurement of a certain region of code we insert a manual synchronization instruction right before line 30, leaving other code untouched:

The Finish() member function waits until all pending computations writing to the array are completed. Running this code again (in Release mode, without a debugger attached) now gave the following results:

While the times for the non-optimized version remained the same, timings for the optimized version are now more reasonable. Note, how the optimizations are applied in multiple stages: when an optimized instruction is executed for the first time only some optimizations are applied and the instruction is run with reasonable efficiency. At the same time, a highly optimized kernel is created in the back. Once the kernel is finished it is used for subsequent execution runs to give even more speed-up. Here, kernel building took 3 iterations of the optimized version. Initially, the optimized version was ~2x faster than the original version. The optimized kernel brought us another ~7x speed-up.

The final speed-up is a factor of ~16x on our machine. Be prepared to see very different results for another hardware! Note, that any manual synchronization likely introduces waiting times during execution which are not there if synchronization is left to the Accelerator. So, in practice, measuring performance can slow down your algorithm! 

ILNumerics Accelerator applies a multitude of optimizations to achieve this speed-up: it merges all three instructions (sum, arange, ones) into a single segment, removes intermediate results, builds low-level kernels, efficiently utilizing SIMD vector instructions, and executes many invocations of such kernels in parallel. 

Further and as a side note: the actual results, i.e.: the element values compared to the non-optimized version were of course correct before adding a call to A2.Finish() ! We had  just skipped the waiting for measurement purposes. Engineers in science and industries will have to reconsider how much value is in measuring individual execution branches and individual array operations on a too-low level of granularity. From now on, such array operations execute in parallel, hence, we must focus on larger computational parts instead. In the end, only the time to the final result is what counts.

Robust Cache Awareness

The measured speed-up can vary not only with your hardware. Let's modify the test code: this time we sum the elements along the rows instead of along the columns. To demonstrate the Accelerators ability to handle constant parameters we introduce a scalar constant variable: d = 1 and give it as parameter to the sum function(s). Here, it is important to declare the variable as const int: axis definitions must be compile time constants or the Accelerator would leave the instruction without optimization. Alternatively, of course, providing a constant numeric literal ('dim: 1') works, too.

Running this snippet produced the following results:

This time the speed-up is ~ 60x. Note, that the execution of the non-optimized part took more than three times as long! This is caused by the strong dependency the execution speed of many common implementations has on the storage order of the data processed. Its detrimental effect, however, has been optimized away by the Accelerator. Its sum() implementation runs efficiently in any direction. Thus, measured execution times did not significantly change compared to the former example, where we had summed along the columns.

Note further, that the Accelerator adjusts the segment to the actual runtime situation. Therefore, multiple invocations are often required to reach the optimal execution speed. From the timings above the first three invocations already led to significant improvement in execution speed, each. Afterwards, the optimal execution strategy was found. 

Positive Side Effects by Acceleration

Many high-level optimizations applied by ILNumerics Accelerator have a positive effect on the stability and compatibility of your algorithms, which can be seen from the next example: by increasing the problem size to [1.000.000 x 100.000] we hit the limits of the available memory on our test computer. As expected, an OutOfMemoryException is thrown. The accelerated code, however, does consume much less memory and never attempts to allocate huge intermediate results. It completes without error:

Output:

ILNumerics Accelerator often uses much less or even no memory for elided intermediate results. However, due to much more array instructions executing in parallel other memory demands are raised. An accelerated program uses more CPU and may also use more memory.

Arrays as Input Data

So far, the input to sum() has been a binary expression on generator functions (arange(), ones()). Such functions are considered complex, non-scalar constants. ILNumerics Accelerator is often able to optimize them away. We'll finish this article with another example, this time receiving actual n-dimensional array data as input. It applies multiple binary bit manipulation functions to the data and sums the result:

Again, execution times are measured for the first, non-optimized expression and the optimized version, subsequently. Running this snippet on our test machine in Release mode and without debugger attached yields:

We see a speed-up of ~10x. It is - among others - supported by the small element data type (uint), which gives room for larger parallel SIMD vector execution via AVX (8 elements on our CPU). Further - and equally important - the automatic parallelization via multiple CPU cores plays again its advantage here.

Traditional attempts to parallelize workloads like this split the input data into chunks and distribute them onto multiple cores. The attempt to calculate multiple output chunks in parallel requires a certain minimal workload to justify the additional overhead of parallelization. Not so for ILNumerics Accelerator! We don't split the input data but instead delegate whole expression invocations over suitable devices - here: CPU cores.

Profiling this snippet in Intel's VTune® produced the following image (black & orange marks added):

As can be clearly seen, without Accelerator the many iterations over the small workloads were not parallelized at all! The workload is too small for this approach to pay off. In difference to that, the optimized code parallelizes efficiently on 20 threads, leading to much better resource utilization. 

Further, next to threading efficiency we can clearly identify the two stages of JIT compilation: the accelerated code starts in stage 1 with moderate optimizations. This leads to a speed-up of ~5x. Once JIT compilation of low-level kernels finished (see the partly hidden thread on the very bottom in above image) execution continues even faster, eventually gaining a full magnitude.

Conclusion

The set of optimizations performed by ILNumerics Accelerator brings true efficiency to array code execution. It not only distributes large and small workload to parallel hardware. It does so automatically - without requiring you - the programmer - to think about execution strategies or available hardware capabilities.

Even this speed-up leaves room for improvement. We will add more aggressive kernel optimizations and even more efficient memory management to the Accelerator, to name only some. In the next articles we dive into more details of the new optimizations introduced by ILNumerics Accelerator. We will start with array pipelining.

Related articles:

Getting Started, Part II - accelerate kmeans

ILNumerics Accelerator Configuration

faster-array-codes.html Introducing ILNumerics Accelerator