ILNumerics Accelerator - Getting Started, Part I
In this tutorial we will get you going with ILNumerics Accelerator. We will present some simple examples and speed measurements. And you will learn about important aspects of the inner workings of the Accelerator compiler.
No installation is required to work with ILNumerics Accelerator. It can be used on any licensed and activated ILNumerics developer seat. Note, that ILNumerics Accelerator requires a dedicated module license to be added to your seat. You may use a regular trial license.
We will utilize Visual Studio 2022 and C# 8.0. Older versions will work but may require appropriate adjustments to the following examples.
First Accelerator Example
1. Create a new console application:
Open Visual Studio and create a new project: ConsoleApp1. Leave the defaults as suggested by Visual Studio (NET6.0 target, AnyCPU Console App - but most other configurations will do, too).
2. Import ILNumerics.Computing:
Right-click on the ConsoleApp1 node in Solution Explorer. Select "Manage Nuget Packages" to open the nuget package explorer. Find and install the ILNumerics.Computing package into the project. Note: if you are working with a prerelease of ILNumerics Accelerator, make sure to check the "Include prerelease" checkbox! If you are fine with the license conditions confirm the corresponding dialog box displayed.
3. Test the simple app:
Open Program.cs from the "Hello World" app created as a stub. Replace the content of the file with the following:
Hit F5 to start the app. The output should read something similar to this:
4. A simple accelerator expression:
Replace the content of the file with the following, more interesting array expression:
Congratulations! You have just summed 10.000 columns, each containing the numbers between 1 and 10.000. Instead of displaying the result, we have computed this array twice. Note, how the second expression is surrounded by two comments: //ILN(enabled = true) and //ILN(enabled = false). Their meaning will become obvious soon.
5. Bring in the Accelerator
Go back to the nuget package manager, find and import the ILNumerics.Accelerator.Command package. Make sure to include prerelease versions in the beta phase! Confirm the license text box, if you agree to the conditions.
6. Hit F5 to build and start the app.
The output is the same as before. What else did we expect !? Observe the Build tab from the Visual Studio output tool window. It displays the following message:
ILNumerics Accelerator cannot optimize your code because the project does not allow unsafe code to be used. Make sure to enable 'Unsafe code' in the project settings (Visual Studio) or to in- clude the <AllowUnsafeBlocks> tag with the value 'true' in the *.csproj project definition file and try again!
ILNumerics Accelerator attempts to replace array expressions from your code file with optimized versions. The original code files will not be altered but the optimized code will be compiled into the output assembly. The optimized version makes heavy use of pointers, hence it requires your project to enable the feature: 'unsafe code'. Since a new project does not have this feature enabled by default, the accelerator compiler did not replace any code yet.
7. Enable 'unsafe code':
Open the project options, by right-clicking on the project node in Visual Studios Solution Explorer. In the Build -> General tab enable the "Unsafe Code" checkbox:
8. Select Release Configuration
The Accelerator is all about performance. Its output is not intended for (and in fact cannot support) debugging. Thus, the Accelerator requires your project to be compiled in Release mode !
Note: in the prerelease version the Accelerator works in Debug mode, too. However, debugging accelerator generated code is quite challenging and not recommended!
8. Hit F5 to build and start the app.
Again, we get the same output: "Results equal: true". The warning message in the build output tool window has gone. This time, the accelerator was actually run. But how can we tell... !?
Verifying Accelerated Code
Two aspects are of utmost importance when it comes to code acceleration:
- The result must be the same.
- The result must be calculated faster.
Note: the first aspect deserves its own article, really. In short: as with all parallel executions results may not be exactly the same, when computed on different paths / in different order / on different devices. This is especially true for floating point calculations. But for high-level optimizations it is also true in a more general sense: depending on the fastest execution strategy and/or the fastest device segments must be implemented (compiled to low-level device codes) in very different ways! Instead of comparing results for exact (binary) identity you would check that the absolute difference between the results lays within the range of the expected error, given the algorithm used. For floating point operations, this error commonly corresponds to the round-off error, accumulated over the instructions involved. No-one said it would be easy ...
In order to verify correctness and to measure speed-up we need a way to specify which expressions are accelerated and which are left alone. Here, the code comments inserted earlier come into play. The Accelerator works like a state machine: right before the 2nd expression the comment ILN(enabled = true) enabled the Accelerator. It keeps the enabled state until the first of: the end of the file is reached or another code comment is found, disabling the Accelerator.
Since we have marked the second expression only we can use the result of the first to compare against the result computed by the accelerated expression. (In this case we - exceptionally - compare for exact identity.)
Now, let's focus on the second aspect: speed-up. Throughout this article we will use a simple scheme: expressions to be measured are repeatedly calculated: after 10 inner repetitions the average measured execution time is displayed and the whole measurement is repeated for 5 times. After applying this scheme to the original code we go on, measuring the accelerated code:
Compile & run this in Release mode and without a debugger attached (CTRL + F5) ! On our test computer this gave the following result:
So, while the first expression required 450 ms on average, the second expression - the one we have accelerated - required 0 ms on average! This sounds like quite some speed-up, doesn't it ?!
But wait! Measuring performance has always been a challenge - even in the sequential world. Now, that we live in a parallel world things become ... well, even more interesting! In general, whenever you hear someone claiming speed-up factors for some technology of more than a magnitude or two – be careful and doubtful! It likely deserves a closer look.
A closer Look
ILNumerics Accelerator turns the marked expression from our example into a single segment. It not only removes all temporaries, makes an educated guess about the 'best' execution unit to process the segments instructions on, compiles the segment for the selected device, optimizes it for the concrete data and hardware properties, it also enqueues the segment to the device - asynchronously.
The main thread immediately continues preparing subsequent segments. So, in our example, the main thread processes and enqueues all 10 inner iterations before meeting the watch.ElapsedMilliseconds expression in line 29. At this time, however, the segments work is not done yet! Segments were enqueued and are now in an undefined state: some may have finished processing already. Some are still waiting. Some may currently be executing. We don't know!
ILNumerics manages synchronization of segments for you. One especially important synchronization point is when access to the values of an array is required. In our example the first access happens in line 33, when comparing the results. Here, execution is halted until the processing is finished.
In order to make the measurement more fair we insert a manual synchronization instruction right before line 29:
The Finish() member function waits until all pending computations writing to the array are completed. Running this code again (in Release mode, without a debugger attached) now gave the following results:
The speed-up is a factor of ~15x, on our machine and on a single thread! Be prepared to see very different results on your machine! It all depends on your hardware. Note, that any manual synchronization likely introduces waiting times during execution which are not there if synchronization is left to the Accelerator. So, in practice, measuring performance may actually slightly slow down your algorithm!
Further and as a side note: the actual results, i.e.: the element values compared to the non-optimized version were of course correct! We had just skipped the waiting for measurement purposes. Engineers in science and industries will have to reconsider how much value is in measuring individual execution branches and individual segments / array operations. While they are getting executed in parallel we will have to focus on larger computational parts instead! In the end, the value is brought to your real algorithm within your real application. The main goal is not to speed-up small parts thereof, but to get your final results faster!
The measured speed-up can vary not only with your hardware. Let's modify the test code: this time we sum the elements along the rows instead of along the columns. To demonstrate the Accelerators ability to handle constant parameters we introduce a scalar constant variable: d = 1 and give it as parameter to the sum function(s). Here, it is important to declare the variable as const int: axis definitions must be compile time constants or the Accelerator would leave the instruction without optimization. Alternatively, of course, providing a constant numeric literal ('dim: 1') works, too.
Running this snippet produced the following results:
This time the speed-up is ~ 40x. Note, that the execution of the non-optimized part took more than twice as long! This is caused by the strong dependency the execution speed of many common implementations has on the storage order of the data processed. This dependency, however, does not exist for the optimized code. Its sum() implementation runs efficiently in any direction. It did not significantly change compared to the former example, where we have summed along the columns.
Another interesting detail is seen from the last measured times: one repetition completed within 13 ms. So, the implementation and the hardware have some potential for further speed-up. This can also be seen when applying the following configuration in your code, just before the expression to be measured (let's say: line 20):
Positive Side Effects by Acceleration
Some high-level optimizations applied by ILNumerics Accelerator can have a positive effect on the stability and compatibility of your algorithms, which can be seen from the next example: by increasing the problem size to [1.000.000 x 10.000] we hit the limits of the available memory on our test computer. As expected, an OutOfMemoryException is triggered. The accelerated code, however, does consume much less memory. Hence, it completes without error:
Arrays as Input Data
So far, the input to sum() has been a binary expression on generator functions (arange(), ones()). Such functions are considered complex, non-scalar constants. ILNumerics Accelerator is often able to optimize them away. We'll finish this article with another example, this time receiving actual n-dimensional array data as input. It applies multiple binary bit manipulation functions to the data and sums the result:
Execution times are measured, again, of the first (non-optimized) expression and the optimized version afterwards. Running this snippet on our test machine in Release mode and without debugger attached gave:
We see a speed-up of a good magnitude (~16x). It is - among others - supported by the small element data type (uint), which gives room for larger parallel SIMD vector execution via AVX (8 elements on our CPU). Further - and equally important - the automatic parallelization via multiple CPU cores plays again its advantage here.
Traditional attempts to parallelize workloads like this split the input data into chunks and distribute them onto multiple cores. The attempt to calculate multiple output chunks in parallel requires a certain minimal workload to hide latencies and other overhead of parallelization. Not so for ILNumerics Accelerator! We don't split the input data but instead distribute whole expression invocations over suitable devices - here: CPU cores.
Profiling this snippet in VTune produced the following image (black marks added):
As can be clearly seen, without Accelerator the many iterations over the small workloads were not parallelized at all! In difference to that, the optimized code parallelizes nicely on 8 threads, leading to much better resource utilization.
As before, even this great speed-up leaves room for improvement. Applying the same specializing kernel optimization as before (ILNumerics.Segment.Default.SpecializeFlags = ILNumerics.Core.Segments.SpecializeFlags.BSDsAll) leads to a further decrease in execution times:
faster-array-codes.html Introducing ILNumerics Accelerator