ILNumerics Accelerator: Getting Started, Part I

ILNumerics Accelerator is an optimzing JIT compiler for array codes. It reformulates array expressions with the ultimate goal of fastest execution speed. The compiler was engineered to enhance efficiency in utilizing diverse, heterogeneous compute hardware while simultaneously prioritizing practical aspects such as developer convenience and scalability for large-scale projects and teams.

Throughout the following guides we present various advantages of the new array compiler. This first article focusses on optimizations performed on individual, low level array expressions. Part II investigates optimizations applied by the compiler to a larger function context (kmeans). In part III ILNumerics Accelerator uses Intel's MKL to compute a large number of FFTs faster than ... Intel's MKL (sic) – due to better parallelization. And part IV examines the advantages of our new array instruction parallelism compared to traditional methods.

Faster Low Level Expressions

In this set of tutorials we start with some simple examples and speed measurements. You will learn about the various aspects of the ILNumerics Accelerator compiler and how it makes your code faster on a low and on a high granularity level.

ILNumerics Accelerator, by default, transforms any array expression in your code. The result would be hard to debug. Hence, the compiler is enabled in Release mode only. To better understand the impact caused by the JIT compiler we will make use of various configuration options. By including or exluding code regions for acceleration we can measure and compare the execution of selected parts of the code with and without the accelerator.

Installation

No installation is required to work with ILNumerics Accelerator. It can be used on any licensed and activated ILNumerics developer seat. The accelerator comes as part of ILNumerics Computing Engine (ILNumerics.Computing nuget package). If you haven't done yet download and activate a trial license here.

In the following we will use Visual Studio 2022 and C# with .NET 8.0. Other versions work just the same.

First Accelerator Example

1. Create a new console application:

Open Visual Studio and create a new C# project: ConsoleApp1. Leave the defaults suggested by Visual Studio (target the most recent .NET version SDK installed, AnyCPU Console App - but most other configurations will do, too).

2. Import ILNumerics.Computing:

Right-click on the ConsoleApp1 node in Solution Explorer. Select "Manage Nuget Packages" to open the nuget package explorer. Find and install the ILNumerics.Computing package into the project. Make sure to search for pre-release packages during the beta phase.

3. Write a simple app:

Open Program.cs from the "Hello World" app created as a stub. Replace the content of the file with the following array expressions. They don't have a meaning here, really. Any other array expressions would work, too:

Hit F5 to start the app. The output should read:

Great! We have just summed 10.000 columns, each containing the numbers between 1 and 10.000. Instead of displaying the result, we have computed this array twice and compare the results.

5. Execute in Release mode:

So far we ran the project in Debug mode. Now, let's switch to Releae mode and start the app without debugging (Visual Studio, CTRL-F5)!

We get the same output: "Results equal: true". This time, the accelerator was actually run. We can tell, because the number of accelerated segment executions is greater than 0.

Note, how the first expression is surrounded by two comments: //ILN(enabled = false) and //ILN(enabled = true). They disable the accelerator only for this code region. The array expressions inside are executed without acceleration. (The CLR JIT still performs its own optimizations, though. Here, the execution speed roughly corresponds to ILNumerics.Computing version 6.) Learn more about your options to control the Accelerator.

Verifying Accelerated Code

Two aspects are of utmost importance when it comes to code acceleration:

The result must be 'equally correct'.
The result should be calculated faster.

In order to verify correctness and to measure speed-up we compare accelerated expression results with the result of running the original expression the traditional way (using ILNumerics.Computing). Here, the code comments inserted earlier come into play. We can use them not only to ensured a correct result but also to measure the second aspect: speed-up.

The following modifcations measure our expression repeatedly: after 10 inner repetitions the average measured execution time is displayed and the whole measurement is repeated for 5 times. After applying this scheme to the original code we go on, measuring the accelerated code:

Compile & run this in Release mode and without a debugger attached (CTRL + F5) ! On our test computer this gave the following result:

So, while the first expression required 185 ms on average, the same - but accelerated expression required 0 ms on average! This sounds like quite some speed-up, doesn't it ?!

But wait! Measuring performance has always been a challenge - even in the sequential world. Now, that we live in a parallel world things become even more interesting! In general, whenever you hear someone claiming a speed-up factor of more than a magnitude or two – be careful and doubtful! It likely deserves a closer look.

A closer Look

ILNumerics Accelerator turns the marked expression from our example into a single segment. It not only removes all temporaries and constant parameters, makes an educated guess about the 'best' execution unit to process the segments instructions on, compiles the segment for the selected device, optimizes it for the concrete data and hardware properties, it also enqueues the segment to the device for asynchronous execution.

Without spending cycles waiting for the result the main thread immediately continues preparing subsequent segments. So, in our example, the main thread processes and enqueues all 10 inner iterations before eventually arriving at the watch.ElapsedMilliseconds expression in line 30. At this time, however, the segments work is not done yet! Segments were enqueued and are now in an undefined state: some may have finished processing already. Some are still waiting. Some may currently be executing. We don't know!

ILNumerics manages scheduling of segments for you. One especially important synchronization point is when access to the values of an array is required. In our example the first access happens in line 34, when comparing the results. Here, execution is halted until the processing is finished. At this time, all 5 outer iterations have been visited already.

In order to allow us the measurement of a certain region of code we insert a manual synchronization instruction right before line 30, leaving other code untouched:

The Finish() member function synchronously waits until all pending computations writing to the array are completed. Running this code again (in Release mode, without a debugger attached) now gave the following results:

While the times for the non-optimized version remained the same, timings for the optimized version are now more reasonable. Note, how the optimizations are applied in multiple stages: when an optimized instruction is executed for the first time only some optimizations are applied and the instruction is run with reasonable efficiency. At the same time, a highly optimized kernel is created in the back. Once the kernel is finished it is instead used for subsequent execution runs to give even more speed-up. Here, kernel building took 3 iterations of the optimized version. Initially, the optimized version was ~3x faster than the original version. The optimized kernel brought us another ~7x speed-up.

The final speed-up is a factor of ~21x on our machine. Be prepared to see different results for other hardware, for other data sizes, other instructions and other precisions! For example, when using double precision floating point elements in the example above the overall acceleration on our machine was found to be ~16 times 'only', accounting for shorter SIMD vector length and higher memory demand for double precision values.

ILNumerics Accelerator applies a multitude of optimizations to achieve this speed-up: it merges all three instructions (sum, arange, ones) into a single segment, removes constant arrays, removes intermediate results, builds low-level kernels, efficiently utilizing SIMD vector instructions, and executes many invocations of such kernels in parallel.

Further, note, that any manual synchronization (as the Finish() instruction in our example) introduces unwanted waiting times during execution which are not there if synchronization is left to the Accelerator. So, in practice, measuring performance can actually significantly slow down your algorithm!

Further and as a side note: the actual results, i.e.: the element values compared to the non-optimized version were of course correct before adding a call to A2.Finish() ! We had just skipped the waiting for measurement purposes. Engineers in science and industries will have to reconsider how much value is in measuring individual execution branches and individual array operations on a too-low level of granularity. From now on, such array operations execute in parallel, hence, we must focus on larger computational parts instead. In the end, only the time to the final result is what counts.

Robust Cache Awareness

The measured speed-up can vary not only with your hardware. Let's modify the test code: this time we sum the elements along the rows instead of along the columns. To demonstrate the Accelerators ability to handle constant parameters we introduce a scalar constant variable: d = 1 and give it as parameter to the sum function(s). Here, it is important to declare the variable as const int: axis definitions must be compile time constants or the Accelerator would leave the instruction without optimization. Alternatively, of course, providing a constant numeric literal ('dim: 1') works, too.

Running this snippet produced the following results:

This time the speed-up is ~ 60x. Note, that the execution of the non-optimized part took more than three times as long! This is caused by the strong dependency the execution speed of many common implementations has on the storage order of the data processed. Its detrimental effect, however, has been optimized away by the Accelerator. Its sum() implementation runs efficiently in any direction. Thus, measured execution times did not significantly change compared to the former example, where we had summed along the columns.

Note further, that the Accelerator adjusts the segment to the actual runtime situation. Therefore, multiple invocations are often required to reach the optimal execution speed. From the timings above the first three invocations already led to significant improvement in execution speed, each. Afterwards, the optimal execution strategy was found.

Arrays as Input Data

So far, the input to sum() has been a binary expression on generator functions (arange(), ones()). Such functions are considered complex, non-scalar constants. ILNumerics Accelerator is often able to optimize them away. We'll finish this article with another example, this time receiving actual n-dimensional array data as input. It applies multiple binary bit manipulation functions to the data and sums the result:

Again, execution times are measured for the first, non-optimized expression and the optimized version, subsequently. Running this snippet on our test machine in Release mode and without debugger attached yields:

We see a speed-up of >10x. It is - among others - supported by the small element data type (uint), which gives room for larger parallel SIMD vector execution via AVX (8 elements on our CPU). Further - and equally important - the automatic parallelization via multiple CPU cores plays again its advantage here.

Traditional attempts to parallelize workloads like this split the input data into chunks and distribute them onto multiple cores. The attempt to calculate multiple output chunks in parallel requires a certain minimal workload to justify the additional overhead of parallelization. Not so for ILNumerics Accelerator! We don't split the input data but instead delegate whole expression invocations over suitable devices - here: CPU cores.

Profiling this snippet in Intel's VTune^® produced the following image (black & orange marks added):

As can be clearly seen, without Accelerator the many iterations over the small workloads were not parallelized at all! The workload is too small for this approach to pay off. In difference to that, the optimized code parallelizes efficiently on 20 threads, leading to much better resource utilization.

Further, next to threading efficiency we can clearly identify the two stages of JIT compilation: the accelerated code starts in stage 1 with moderate optimizations. This leads to a speed-up of ~5x. Once JIT compilation of low-level kernels finished (see the partly hidden thread on the very bottom in above image) execution continues even faster, eventually gaining a full magnitude.

Conclusion

The set of optimizations performed by ILNumerics Accelerator brings true efficiency to array code execution. It not only distributes large and small workload to parallel hardware. It does so automatically - without requiring you - the programmer - to think about execution strategies or available hardware capabilities.

Even the speed-up seen leaves room for further improvement! The Accelerator will constantly be improved to add even more aggressive kernel optimizations and even more efficient memory management to the Accelerator, to name only some. In the next articles we dive into more details of the new optimizations introduced by ILNumerics Accelerator, including array pipelining.

Speed Comparison: ILNumerics, FORTRAN, numpy

Getting Started, Part II - accelerate kmeans

ILNumerics Accelerator Configuration

Introducing ILNumerics Accelerator