ILNumerics Accelerator – Faster Array Codes
Getting Started, Part II – A Practical Example
TLDR: A new, long awaited approach automatically executes arbitrary, numerical array algorithms – efficiently, on regular (heterogeneous) hardware. It does not require user guidance. It scales with the hardware and without recompilation. It runs on any platform supporting .NET. For us it is the result of a decade of research. For the world it might become a real solution to the popular parallel programming problem. See it working for the first time!
In part one we've demonstrated how the new Accelerator compiler speeds-up some simple array expressions. Now, let's get our hands dirty with an even more practical example. Watch it putting your hardware on steroids!
An old Acquaintance: K-Means
Recently, I found a sponsored article on code project. They used Intel® Extension for Scikit-learn to speed-up the popular kmeans algorithm in python's scikit-learn machine learning library. A good number of ILNumerics customers is using python for authoring algorithms, before bringing them onto their devices using ILNumerics Computing Engine. Therefore, numpy is well known to us. In fact, ILNumerics is fully compatible to all numpy features – even all subarray and broadcasting things! A while ago we've compared the speed of kmeans on ILNumerics with python, Fortran and Matlab. ILNumerics won. However, at that time, it was far from exhausting all capabilities! We will stick to kmeans for now, because it requires significant calculations, does not call third-party libraries (like LAPACK), is not too simple to parallelize but simple enough to understand, what is going on.
While Intel®'s Extension for Scikit-learn is the result of manual, hand-tuned optimizations we aim at optimizing arbitrary, user written algorithms based on n-dimensional arrays. Therefore, we take a simple implementation of kmeans, using well-known array functions from ILNumerics.Computing. Later, we will have the ILNumerics Accelerator compiler turn it into optimized code.
The baseline code:
Let's quickly step through the code: you can clearly see that only the regular ILNumerics function rules have been applied. The kmeans function receives an input array and an output array. It returns a RetArray<double> with the newly computed centers. It also defines a scalar output variable 'iterations' to hold the number of iterations performed by kmeans when the function returns.
Within the function we initialize the centers by taking samples from the input array A in line 6. An array of class assignments will store the indices of the cluster for each sample in A. It is always a good idea (for performance reasons) to pre-allocate any array data with the correct, required size. This is done in line 7. After some more helper variable declarations the main loop begins.
Each iteration of kmeans comprises of two steps: for each sample in A the nearest cluster is calculated, based on the minimal distance to each cluster center. Next, cluster centers are recomputed, taking also into account newly assigned samples. These steps are repeated until the classes or center weights do not change anymore.
The most workload of the first step is processed in line 20. This line computes the L2 distance to each cluster center for the current sample of $A_i$. It is repeated for each sample (column) of A. There is nothing special here, except that we called Dispose() on the result returned from ILMath.min(...), because it is not used nor assigned and would otherwise trigger the GC at some point. It is not obligatory and if you forget to clean up here, it may hurt performance a little - but nothing worse would happen. Here, all we need is the index which is returned as OutArray<long>.
The second part starts at line 26 and is also straight forward. For each of the existing centers we search for all samples being assigned to that center. We calculate the mean of all samples in the group to get the new cluster center. Both parts are repeated until the class assignments do not change anymore.
First Run : Well known Speed of ILNumerics
Disclaimer: If you are working on the edge of performance you know that performance can be a beast! How fast an algorithm really runs depends on many factors. At the minimum, you always want to make sure to compile your program in Release mode and have it run without debugger attached! Otherwise, your times will increase significantly - and can be slower by several magnitudes! Further, leave the first iteration out of your measurement. It is dedicated to the JIT compiler. This is even more true for ILNumerics Accelerator, which performs a lot of compiler optimizations at runtime.
Test data and measurement runs: for testing the kmeans function we use the following code. Note, how all this references the package ILNumerics.Computing only.
In line 1 we initialize A from the test data delivered in ILSpecialData.terrain. Since we are at most interested in performance we modify A in a way to produce stable clusters in 5 iterations - very useful for debugging! This setup tests the kmeans function with following parameters:
|A.S (data dimensions)||A.S (number of samples)||k (number of clusters)|
As you can see from the code, we perform 2 test runs, providing the same parameters for each run. Both calls to kmeans() run the code as displayed above. It gives an output similar to this:
The result was computed in 5 iterations. In each iteration of kmeans we display the time in ms required for the first part, for the second part and the sum of both values. Note, how the output corresponding to the first kmeans() call is marked "(cold)" since it includes some overhead for the CLR's JIT. However, in this case the measured fluctuations in execution time hide this small overhead. On avarage, each iteration within kmeans took ~400ms.
Optimize! Bring in the Accelerator!
Alright. As a performance expert you know the usual next steps: go out and investigate potential bottlenecks by extensive profiling runs on a variation of common input parameters. Identify code that can be parallelized. Design a strategy to bring this code and corresponding data efficiently to all computing resources. Refine your design to make it more efficient. Save memory! Profile again! Start all over until you ran out of budget ... No, just kidding. All this work now became obsolete! ILNumerics Accelerator performs all these optimizations for you - and many more!
You may have noticed, that the kmeans code already marks the main iteration body for optimization with ILNumerics Accelerator. These marks always appear as comment, hence are NOPs for the C# compiler. Now, add the ILNumerics.Accelerator.Command nuget package to your project:
We reuse the same test code, change the Console info text from "Non-optimized" to "Optimized" and run the measurement test again:
Timings are now better by roughly a magnitude.
For the first column (the first part of each kmeans iteration) speedup is even higher than 10x! The times for the second part were slightly increased, though. This is a due to the inner workings of ILNumerics Accelerator and deserves a longer explanation, which is outside of the scope of this article. For now, the take away is: a strong overall speed-up was gained. Let's see where it comes from!
A Look inside
The advantage in speed becomes obvious when looking at the utilization rate of our physical cores during execution. The following image comes from Intel's VTune analyzer. We've marked and labeled interesting regions:
This time, both measurement tests were run in the same program. After the startup phase the first and second calls to the simple, non-optimized kmeans() version execute on a the main thread - only. This is because traditional methods of parallelization would not be able to efficiently distribute the workload of individual array functions. Afterwards, the optimized version of the same kmeans code is run (marked yellow). Here, we clearly see, that it executes on the main thread and on all 4 worker threads in parallel. By looking at the peaks in CPU utilization, all 2 x 5 iterations performed by kmeans() are clearly visible. They have all performed in parallel!
Efficient Use of All Resources
When we compare the speed-up of the optimized version with the original version (~ 10 times faster) one might notice a gap: even if we somehow manage to optimally utilize 5 cores (compared to one) we could not explain a 10 times speed-up !?
ILNumerics Accelerators goal is to utilize all computing resources efficiently! This includes SIMD vector extensions on a single core. It includes GPU and other accelerating devices found on the computer at runtime. And of course it also implies efficient use of memory! Here, the overall speed-up is achieved by addressing the CPUs vector extensions on multiple cores, by limiting the memory required, by merging subsequent array expressions and by fitting the optimal execution strategy to the actual runtime situation.
Limitations in the Beta
The new and patented method of ILNumerics Accelerator enables efficient parallelization, even for small and mid-sized arrays. It automatically detects parallel potential within your code and addresses all hardware resources found at runtime. Note, that even the speed increase in the example above does not yet mark the limit! There is a ton of further optimization potential which we will make available over time.
The current beta supports optimizations for the CPU: vector extensions, memory efficiency and automatic distribution to multiple cores. Automatic distribution to GPUs and other OpenCL devices will be supported, too. Again: ILNumerics Accelerator does not require any user guidance! It makes all decisions for you.
You do the math - we do the rest!
Patents & Trademarks
ILNumerics software is protected by international patents and patent applications. WO2018197695A1, US11144348B2, JP2020518881A, EP3443458A1, DE102017109239A1, CN110383247A, DE102011119404.9, EP22156804. ILNumerics is a registered trademark of ILNumerics GmbH, Berlin, Germany.