ILNumerics Accelerator: Getting Started, Part II

Faster Array Codes – A Practical Example: K-Means

TLDR: A new, long awaited approach automatically executes arbitrary, numerical array algorithms – efficiently, on regular (heterogeneous) hardware. It does not require user guidance. It scales with the hardware and without recompilation. It runs on any platform supporting .NET. For us it is the result of a decade of research. For the world it might become a real solution to the popular parallel programming problem. See it working for the first time!

In part one we've demonstrated how the new Accelerator compiler speeds-up some simple array expressions. Now, let's get our hands dirty with an even more practical example.

An old Acquaintance: K-Means

Recently, I found a sponsored article on code project. They used Intel® Extension for Scikit-learn to speed-up the popular kmeans algorithm in python's scikit-learn machine learning library. A good number of ILNumerics customers is using python for authoring algorithms, before bringing them onto their devices using ILNumerics Computing Engine. ILNumerics is fully compatible to all numpy features – even all subarray and broadcasting details. A while ago we've compared the speed of kmeans on ILNumerics with python, Fortran and Matlab. ILNumerics won. However, at that time, it was far from exhausting all capabilities! We will stick to kmeans for now, because it requires significant calculations, does not call third-party libraries (like LAPACK), is not too simple to parallelize but simple enough to understand, what is going on.

While Intel^®'s Extension for Scikit-learn is the result of manual, hand-tuned optimizations we aim at optimizing arbitrary, user written algorithms based on n-dimensional arrays. Therefore, we take a simple implementation of kmeans, using well-known array functions from ILNumerics.Computing. Later, we will have the ILNumerics Accelerator compiler turn it into optimized code.

The baseline code targets .NET 6.0 and can also be downloaded as C# project:

Note, how code acceleration has been disabled at the biginning of the file. Within the kmeans function, only the regular ILNumerics function rules have been applied: it receives an input array and an output array and returns a RetArray<double> with the newly computed centers. It also defines a scalar output variable 'iterations' to hold the number of iterations performed by kmeans when the function returns.

Within the function we initialize the centers by taking samples from the input array A in line 23. An array of class assignments will store the indices of the cluster for each sample in A. For performance reasons it is always smart to pre-allocate any array data with the correct, required size. This is done in line 24. After some more helper variable declarations the main loop begins.

Each iteration of kmeans comprises of two steps: for each sample in A the nearest cluster is calculated, based on the minimal distance to each cluster center. Next, cluster centers are adjusted to reflect the new cluster assignments.

The most workload of the first step is processed in line 35. This line computes the L2 distance to each cluster center for the current sample of $A_i$. It is repeated for each sample (column) of $A$. Nothing special, really.

The second part starts at line 40 and is also straight forward: for each of the existing centers we search for all samples being assigned to that center. We calculate the mean of all samples in the group to get the new cluster center. Both parts are repeated until the class assignments do not change anymore.

First Run : Well known Speed of ILNumerics

If you are working on the edge of performance you know that performance can be a beast! How fast an algorithm really runs depends on many factors. On .NET - at the minimum - you always want to make sure to compile your program in Release mode and have it run without debugger attached! Otherwise, your run times increase significantly – potentially by several magnitudes! Further, leave the first few iterations out of your measurement. They are required by the JIT compiler(s). This is even more true for ILNumerics Accelerator, which performs a lot of JIT compilation at runtime. Often, a few invocations are required to get up to full speed.

The code for loading test data and performing the measurement runs is found at the top. In line 8 we initialize $A$ with test data delivered in ILNumerics.SpecialData.terrain. The scaling of $A$ with the sin signal produces stable clusters which will be found in exactly 5 iterations - very useful for a reliable performance measure! This setup tests the kmeans function with following parameters. Note, that these sizes are really arbitrary. Feel invited to try it out and vary them at your will!

	A.S[0] (data dimensions)	A.S[1] (number of samples)	k (number of clusters)
Input size	1203	1203	50

As you can see from the code, we perform and measure multiple invocations of kmeans() in a loop. It gives an output similar to the following. Again, make sure to run in release mode, without a debugger attached to have the .NET JIT perform all common optimizations:

The result was computed in 5 iterations. We display the time in ms required for each kmeans() call. Each call to kmeans takes roughly 2.000ms.

Bring in the Accelerator

As a performance expert you know the common next steps: go out and investigate potential bottlenecks by extensive profiling runs on a variety of common input parameters. Identify code that can be parallelized. Filter-out code, likely not paying-off for parallelization. Determine the execution hardware and design a strategy to bring this code and your data efficiently to all hardware resources. Refine your design after test runs failed. Save more memory! Profile again! Start all over until you ran out of budget ...

But, luckily, ILNumerics Accelerator performs all optimizations for you - and even many more! Just remove the comment in line 5 to have the Accelerator speed up the whole file. We reuse the same test code and run the measurement test again:

Timings have been improved by factor ~15, compared to the non-accelerated code.

A Look inside

The advantage in speed becomes obvious when looking at the utilization rate of our physical cores during execution. The following image has been captured from Intel's VTune analyzer. It was created by running the accelerated code above right before the non-accelerated version - both inside the same program. Interesting regions have been marked and labeled:

After the startup phase the optimized version of the k-means code is run (marked yellow). We clearly see that it executes on the main thread and on all 19 worker threads in parallel. Afterwards, the non-optimized k-means version executes on the main thread only. This is because traditional attempts for parallelization would not be able to efficiently distribute the workload onto multiple threads.

The Accelerator performed a multitude of optimizations to our code: the array instructions min, sqrt, sum, squared, and minus '-' are all merged into a single, highly efficient segment. Its kernels use your available SIMD vector registers for efficient calculation. Intermediates are removed. Segments are invoked on many threads in parallel (see: array pipelining).

Conclusion & Outlook

The new methods of ILNumerics Accelerator enable efficient parallelization, even for small and mid-sized data. It automatically detects parallel potential on multiple levels of granularity within your code and addresses all hardware resources found at runtime. Note, that even the speed increase in the example above does not yet mark the limit!

Optimizations by low-level kernels were not yet fully recognized, due to the time required to compile the kernel at runtime. More efficient caching will improve this in the future. Further, frequent synchronization stalls in the second part of the algorithm will go away once all array instructions are supported for array pipelining. Be prepared to see even faster results in subsequent releases!

The current release candidate (Oct 2024) supports optimizations for the CPU: vector extensions, memory efficiency and automatic distribution to multiple cores. It already supports GPUs but does not yet perform automatic distribution to GPUs and other OpenCL devices. These feature is coming, too. Currently, all array functions of ILMath.* and numpy.* are supported. This will be extended to support all functions and properties in ILNumerics.

ILNumerics Accelerator does not require any user guidance! It makes all decisions for you - efficiently and autonomously.

You do the math - we do the rest!

Join us as we uncover the next part, where we accelerate the fastest FFTs on earth!

Patents & Trademarks

ILNumerics software is protected by international patents and patent applications. PCT/EP2024/062463, EP23173406, WO2018197695A1, US11144348B2, JP2020518881A, EP3443458A1, DE102017109239A1, CN110383247A, DE102011119404.9, EP22156804. ILNumerics is a registered trademark of ILNumerics GmbH, Berlin, Germany.