Tag Archives: performance

ILNumerics Accelerator – A better Approach to faster Array Codes, Part II

An better Approach to an unsolved Problem

TLDR: Authoring fast numerical array codes on .NET is a complex task. ILNumerics Computing Engine simplifies this task: it brings a convenient syntax, which is fully compatible to numpy and Matlab. And today, we bring back the free lunch: our brand new JIT compiler not only transforms numerical algorithms into highly efficient codes, it also automatically adopts to and parallelizes your workload on any hardware found at runtime. It segments and distributes even small workloads efficiently – more fine grained than any manual configuration could do.

Author: H. Kutschbach (ILNumerics)  Reading time: 12 min

Since its introduction in 2007 ILNumerics has led innovation for technical computing in the .NET world. We have enabled a short, expressive syntax for authors of numerical algorithms. Writing array based algorithms with ILNumerics feels very similar (and is compatible) to well known prototyping systems, namely: Matlab, numpy and all its successors.

Another focus of ILNumerics has always been: performance. In fact, we started the company in 2013 with the goal to build the fastest technology on earth. Today we propose a better, faster approach to array code execution: ILNumerics Accelerator.

This is part 2 of an article series on ILNumerics Accelerator. In the first part we’ve explained why automatic parallelization was too large a problem for the existing compiler landscape. In the following we start by describing the problem ILNumerics Accelerator solves. Then we’ll explain how it works.

The Problem Space

Math drives the world. High tech companies, financial institutions, medical instruments, rocket science … The most innovative projects do all have one thing in common: they are driven by numerical data and complex mathematical algorithms working on these data. Most data can be represented as n-dimensional arrays. They allow to implement great complexity without great effort. Consequently, array based algorithms play a huge role in todays most innovative industries.

Over the years we have seen many different attempts to realize the advantage of numerical arrays for small and large developer teams, working on .NET. Some large teams have even built their own solution. And indeed: building a math library in C# is easy to start with! You’ll see many low hanging fruits. With a few lines of code, you are able to create matrices and to sum them in short expressions, like ‘A + B’! Woooh!! If this is all you need then you can safely skip the rest of this article…

Often enough managers loose enthusiasm when after some months huge device codes built with such trivial ‘solution’ eventually show execution speed far below their requirements.

Optimal execution speed is only realized by efficient utilization of all compute resources.

What may sound simple is actually a problem that has been waiting for a solution for more than two decades. Todays hardware is mostly heterogeneous. Even inside a single CPU there are at least two levels of parallelism. Often, there is also at least one GPU around.

The only working way to utilize heterogeneous computing resources today was to manually distribute manually selected parts of an algorithm and its data to manually selected devices. Many tools exist which aim to help the programming expert to decide, write and fine-tune the required codes at development time. This still requires expert knowledge and intimate insights into the data flow of your algorithms. And it requires time. A lot of time! The large codes created this way need testing and maintenance.

Nevertheless, this demanding approach still cannot harness all performance potential! Needless to say that such results are far from being expressive and easy to understand. ILNumerics Accelerator is here to solve this problem for the large domain of numerical array codes.

The Big Performance Picture

Let’s widen our perspective on the performance topic! Three major factors influence execution speed:

1. The Algorithm

Basically, a program performs a transformation of data from one representation (input) into another representation (result). The transformation is described by the program´s algorithm(s). This high-level, business view on software is ‘lowered’ by the programmer when she writes the concrete program code, enabling a compiler to understand the intent. In subsequent compilation steps the code is then further lowered by one or more compilers into other representations, which can be better understood by the hardware. So, the programmer and the compiler(s) share responsibility to implement the intended transformation in a way that it is carried out on the available hardware and produces the correct result in the shortest time possible.

2. The Hardware

An optimally fast program must recognize and utilize all available hardware resources. The exact configuration and properties of all devices, however, is typically only known at runtime. This implies that a significant amount of transformation can only be done at runtime (-> JIT compiler). One may say, a program could be specialized for one specific set of computing resources. This is of course true, but it only shifts issues to the next time the hardware is to be renewed. Further, some properties of the hardware are inherently only known at runtime. Two examples: resource utilization rate and influences by other programs / threads running concurrently on the computer. If the final executable program fails to keep all parts of the hardware busy with useful instructions the program will require more time to run than necessary.

3. The Data

In general technical data is made out of individual numbers, or ‘scalars’. Traditionally, processor instructions deal with scalar data only. In array algorithms data forms arrays with arbitrary shape, size and dimensionality. This can be seen as an extension to the traditional scalar data model, since n-dimensional arrays include scalar data as a special case.

The array extension introduces great new complexity. Instead of a single number, having a type and a value, an array instruction deals with input data which may have one or many, often several million individual values! But arrays can also be empty…

While there is often exactly one way to execute a scalar instruction, to execute an array instruction efficiently requires careful plan and order and to select and align the execution strategy to the (greatly varying) data and hardware properties. Again, failing to do so (at runtime) will lead to sub-optimal execution times.

Workload & Segmentation

Together with the algorithm data forms another important factor: ‘workload’. It is a measure of the number of computational steps required to perform a certain data transform. Let’s say: the number of clock cycles of a CPU core required to compute the result of the ‘sin(A)’ instruction for a matrix A of a certain size. Mapping the workload onto a concrete hardware device yields the effort or cost (in terms of minimal time or energy) required to complete the transform on this device.

Obviously, throughout a program´s execution there are many intermediate steps, transforming various data with varying properties (-sizes, etc.). The overall workload – in a slightly simplified view – is the aggregation of all workloads of all intermediate steps.

To keep a program´s execution time small, all intermediate steps must be carefully adopted to and execute on the fastest hardware resource(s) currently available, yielding the lowest cost. Compare this to how a programmer today manually selects a certain device for (manually selected) large, predefined parts of a program at development time!

Manual segmentation and device resource selection cannot bring optimal performance for all parts of a program.

Array Codes – done (& run) right

The considerations above should make it obvious that no static compiler is able to create optimal execution performance (letting toy examples aside). Data and hardware properties are subject to change! So does the optimal strategy for lowering the individual parts of a user algorithm to hardware instructions! Lowest execution times will only be achieved, if all parallel potential of an algorithm is identified and used to execute its instructions concurrently, on all available hardware resources, and efficiently.

But let’s skip all reasoning and jump right into the interesting part! Here is how ILNumerics Accelerator achieves exactly this goal:

  1. Array instructions within the user program are identified and merged into ‘segments‘.
  2. Segments limits are optimized to hold chunks of suitable workload whose size can be quickly determined at runtime.
  3. Before executing a segment …
    1. The cost of the segment with the current data is computed for each hardware resource.
    2. The best (i.e.: the fastest) hardware resource is selected.
    3. The segments array instructions are optimized for the selected resource and current data.
  4. The kernel is scheduled on the selected resource for asynchronous execution and cached for later reuse.

Likely, this list requires some explanation.

[ Video does not show? Download here: https://ilnumerics.net/media/andere/ILNumerics_Segments_2022.mp4 ]

ILNumerics Accelerator takes a new approach to efficient array code execution. Instead of attempting global analysis (the top-down approach which failed in so many projects) we optimize the program from the bottom-up. Our compiler recognizes all fundamental array instructions and expressions thereof, merges them into ‘islands’ (segments) of well-known semantics and execution cost. Each segment contains a small JIT compiler, specialized to quickly compile the segments instructions at runtime into highly optimized low-level codes – for the .NET CLR and for any OpenCL device! This step recognizes and adopts to all hardware and data properties found at runtime. And it maintains compatibility with all .NET platforms.

Currently, we support unary, binary and reduction array instructions provided by ILNumerics Computing Engine and all expressions made thereof. No changes or annotations are required to existing codes. The ILNumerics language supports all features of the Matlab and numpy languages. It will be possible to build connectors to other languages and to custom array libraries exposing similar semantics.

During build each array expression is automatically replaced from the user code files with a call to a specialized, auto-generated segment. Next to hosting a JIT for the array instructions a segment also ‘knows’ the workload of its inherent computations. When at runtime the segment is called, it uses this info to compute the cost for executing the segment on each device and to identify the device suggesting earliest completion. Afterwards, it enqueues the optimized kernel to this device for computation.

It then – without waiting for a result – immediately passes on control to the next segment in the algorithm. Now, if subsequent segments do not depend on each other they are scheduled onto different threads or devices and execute in parallel!

All technical details of the new method are found in the patents and patent applications listed at the end of this article.

Advantages

ILNumerics Accelerator automatically executes your algorithm in the fastest possible way:

  • During execution it finds chunks of workload which can be computed concurrently.
  • It identifies the best suited hardware device available at this time.
  • For each segment it adopts the execution strategy to the fastest device, and
  • Computes the workload of many segments in parallel.

Therefore, it scales execution times with the hardware without recompilation or manual code adjustments.

Our compiler …

  • Supports vector registers, multiple CPU cores, and all OpenCL devices.
  • Removes temporary arrays and manages memory between devices transparently.
  • Applies many new and important optimizations, on kernel and on segment level.
  • Fully managed solution: it targets the .NET CLR, thus is compatible with all .NET platforms.

… which brings many benefits:

  • Efficient use of all compute resources.
  • No expert effort required.
  • Shorter time to market.
  • Adjusts to new hardware.
  • Maintains maintainability.

“How much faster is it”?

This is a very popular question! A comprehensive answer deserves its own article, really.

In short: it depends on your hardware. It depends on your algorithm. And it depends on your data! The question should rather be: how much more efficiently will it utilize my hardware? For the majority of real-world algorithms is true: as efficient as possible! Moreover, it does so automatically!

(If you are still longing for a number: on 2015-ish hardware we see speed increase rates between 3 and more than 300. Most of the time optimized code runs faster by at least a magnitude.)

Where can I get it ?

ILNumerics Accelerator has entered the public beta phase. It will be released with ILNumerics Ultimate VS version 7. The pre-release is available on nuget. Start here ! General documentation on the new Accelerator is found here. It will subsequently be completed within the next weeks.

Patents

ILNumerics software is protected by international patents and patent applications. WO2018197695A1, US11144348B2, JP2020518881A, EP3443458A1, DE102017109239A1, CN110383247A, DE102011119404.9, EP22156804. ILNumerics is a registered trademark of ILNumerics GmbH, Berlin, Germany.

ILNumerics Accelerator – A better Approach to faster Array Codes, Part I

A Truth most Programmers won’t tell.

TLDR: It was twenty years ago when computer manufacturers, while trying to boost CPU performance, were confronted with the hard wall imposed by physical limitations. It took a long time for everyone involved to realize that individual processors could no longer become significantly faster in the future. The previous approach made it too simple to deal with the constantly growing demands caused by ever larger data and ever more complex programs. Haymo Kutschbach, founder and CEO of ILNumerics explains the challenges involved for automatic parallelization on today’s hardware and why traditional compilers will not satisfy the strong demand for a feasible solution. A better, working approach is presented.

Reading time: 10 minutes

For many decades, the rule applied: “Your program is too slow? Simply buy the latest hardware!” This was made possible by the availability of (general purpose) programming languages, as well as a stream of innovations in processor architectures and the associated compilers. They allowed the programs to be written largely independently of the execution hardware. The compiler adapted abstract codes to the low-level features of the processors. Everything fit together so conveniently!

And then nature threw a spanner in the works on this easy calculation. We’ve seen new computers continueing to obey Moore’s Law and become more powerful with each generation. But this new computing power is now distributed over many processors! A program can only use the new speed if it can use all processors at the same time. In addition, a significant part of the computing power of today’s computers is found in heterogeneous architectures such as GPUs and other accelerator hardware.

The consequences can be observed today in probably every development department: Programs are written abstractly. Efforts are made to create ‘clean code’ that can be tested and maintained. Unfortunately, it is often executed too slowly, though! So you start to examine the codes for optimization potential. You identify individual bottlenecks, decide manually on a parallelization strategy, implement, test and measure again. Many programmers don’t see this as a problem – their expert knowledge guarantees them a good income for many years to come! Since there are no alternatives, the enormous delays in time to market are accepted. Just like the fact that the next generation of devices will again require a complete rewrite for large parts of the software.

Hardware today is far more diverse and heterogeneous than it was 20 years ago. But why is that actually a problem? Why are we still not able to automatically adapt and run our programs on such hardware again?

If it’s too slow, blame the Compiler!

For one, it’s because the compiler market has traditionally been dominated by processor manufacturers. Their compilers served to make their own hardware more accessible. Even though we’ve seen a recent trend towards vendor-independent compilers, they still continue the traditional approach: compiling (“lowering”) code of a general purpose programming language (like C or Fortran) to hardware instructions that can be executed on a specific (and previously selected) processor architecture. What we actually need, however, is a compiler that translates an abstract program for a “computer” – with all its diverse computing resources.

This task, however, is much more demanding! Why? Here we come to the second reason that has so far prevented a working solution: granularity. In order to be able to use parallel resources, the software must contain additional instructions that take care of the distribution of the individual program parts. This additional overhead makes a certain minimum program size (workload) mandatory. Parallelization only makes sense if the gain in speed through parallel execution exceeds the additional management effort!

Here, the term “granularity” refers to the number of individual low -level instructions that a piece of code requires to calculate its result. Apparently, the granularity increases with the size or complexity of a program part, because more low-level instructions are executed at the end. So, in order to make efficient use of parallelism, it is not enough to consider individual instructions of a program. Rather, a compiler must merge larger program parts (‘grains’) and distribute them to the available computing resources . One of the most important challenges of a parallelizing compiler is to find the optimal size of these “chunks” – and thus the optimal granularity.

This brings us to the third reason: complexity. In general, programs can become arbitrarily complex. Compilers have the task of translating programs into another, useful form. But how does one translate information that can be arbitrarily complex? Traditional compilers achieve this goal by limiting their consideration to individual instructions in the program. They know the (manageable) set of possible translations for each of these instructions. The finished program is then little more than the chain of such individual translations. Nevertheless, traditional compilers are already considered to be extremely complex software!

Unfortunately, this “complexity reduction” trick is in direct contradiction to the necessary, minimal granularity described above. A parallelizing compiler must face the challenge of finding an optimal translation even for program chunks that consist of multiple or many instructions. As always, low hanging fruits exist (see e.g.: tensorflow). However, an optimal solution must not restrict itself to individual, specific instructions just because their translation is particularly simple and therefore easy to implement!

And then there is a fourth aspect that also plays into the area of complexity. In addition to the optimal chunk size (granularity), it is also important to identify those grains of a program that can be executed independently of one another. The program analysis required must also take into account the best possible granularity. A compiler must therefore be able to understand large program sections, form chunks of an optimal grain size and execute them efficiently in parallel on heterogeneous hardware.

Ultimately, it is the complexity of this task that has so far prevented such a compiler from being born. In general, development departments still follow the manual approach described earlier. It requires expert knowledge and enormous effort to manually perform the necessary steps (granularity and independence analysis, parallel implementation) for a special program . But despite the enormous amount of time, despite the enormous maintenance issues associated, this approach allows to scratch the surface only! It can neither address nor solve the real problem.

Are we lost?

Not yet! Let’s wrap it up:

  • Traditional compilers work at too low a level of granularity. They consider too few instructions to cover a workload, suitable for efficient parallelization.
  • Previous attempts to include large/all program parts in an analysis have only been successful in a few special cases. Global analysis for general programs is an unsolved problem and, in the opinion of the author, will remain so for a long time to come.

But now we come to the topic of our headline! For a not too small class of programs ILNumerics succeeded in developing a compiler that fulfills all optimality requirements. It is the class of numeric, array-based programs. They are written by scientists, mathematicians, and engineers to encode the innovations of our modern times. They form the core of big data, machine learning and artificial intelligence, and all codes, making use of them. Numerical algorithms realize innovations – moreover: they are these innovations! They allow to formulate great complexity without great effort. They handle huge data and, therefore, require greatest speed.

A development department, faced with the challenge of accelerating its numerical array codes can only select and use the best possible language and libraries there are. Technical requirements often rule out prototyping languages, like numpy and Matlab. Individual parts may be outsourced to GPUs. Other parts are parallelized using multithreading. All of this slows down the industrial development process significantly. It is like writing GUI programs in assembler…

A Better Approach to faster Array Codes

ILNumerics proposes a new, faster approach to the execution of array codes. ILNumerics Accelerator, automatically finds parallel potential in array-based algorithms (as: numpy, Matlab, ILNumerics) and efficiently distributes its workload to heterogeneous computing resources – dynamically and at runtime, when all important information is available. It again scales your algorithm speed to any heterogeneous hardware without recompilation.

The online documentation describes technical details of the ILNumerics Accelerator Compiler. Make sure to register for our newsletter here and don’t miss any news!

Where can I get it ?

ILNumerics Accelerator has entered the public beta phase. It will be released with ILNumerics Ultimate VS version 7. The pre-release is available on nuget. Start here ! General documentation on the new Accelerator is found here. It will subsequently be completed within the next weeks. Read the getting started guide(s), get your hands dirty and please, let us know what you think!

Patents

ILNumerics software is protected by international patents and patent applications. WO2018197695A1, US11144348B2, JP2020518881A, EP3443458A1, DE102017109239A1, CN110383247A, DE102011119404.9, EP22156804. ILNumerics is a registered trademark of ILNumerics GmbH, Berlin, Germany.

Performance on ILArray

Having a convenient data structure like ILArray<T> brings many advantages when handling numerical data in your algorithms. On the convenience side, there are flexible options for creating subarrays, altering existing data (i.e. lengthening or shortening individual dimensions on the run), keeping dimensionality information together with the data, and last but not least: being able to formulate an algorithm by concentrating on the math rather than on loops and the like.

Convenience and Speed

Another advantage is performance: by writing C = A + B, with A and B being large arrays, the inner implementation is able to choose the most efficient way of evaluating this expression. Here is, what ILNumerics internally does: Continue reading Performance on ILArray

Fast. Faster …. Performance Comparison: C# (ILNumerics), FORTRAN, MATLAB and numpy – Part II

In the first part of my somehow lengthy comparison between Fortran, ILNumerics, Matlab and numpy, I gave some categorization insight into terms related to ‘performance’ and ‘language’. This part explains the setup and hopefully the results will fit in here as well (otherwise we’ll need a third part :| )

Prerequisites

This comparison is going to be easy and fair! This means, we will not attempt to compare an apple with the same apple, wrapped in a paper bag (like often done with the MKL) nor are we going to use specific features of an individual language/ framework – just to outperform another framework (like using datastructures which are better handled in a OOP language, lets say complicated graph structures or so).

We rather seek for an algorithm of:

  • Sufficient size and complexity. A simple binary function like BLAS: DAXPY is not sufficient here, since it would neglect the impact of the memory management – a very important factor in .NET.
  • Limited size and complexity in order to be able to implement the algorithm on all frameworks compared (in a reasonable time).

We did chose the kmeans algorithm. It only uses common array syntax, no calls to linear algebra routines (which are usually implemented in Intel’s MKL among all frameworks) and – used on reasonable data sizes – comes with sufficient complexity, computational and memory demands. That way we can measure the true performance of the framework, its array implementation and the feasibility of the mathematical syntax.

Two versions of kmeans were implemented for every framework: A ‘natural’ version which is able to be translated into all languages with minimal differences according execution cost. A second version allows for obvious optimizations to get applied to the code according the language recommendations where applicable. All ‘more clever’ optimizations are left to the corresponding compiler/interpreter.

The test setup: Acer TravelMate 8472TG, Intel Core™ i5-450M processor 2.4GHz, 3MB L3 cache, 4GB DDR3 RAM, Windows 7/64Bit. All tests were targeting the x86 platform.

ILNumerics Code

The printout of the ILNumerics variant for the kmeans algorithm is shown. The code demonstrates only obligatory features for memory management: Function and loop scoping and specific typed function parameters. For clarity, the function parameter checks and loop parameter initialization parts have been abbreviated.

public static ILRetArray<double> kMeansClust (ILInArray<double> X, 
                                             ILInArray<double> k,
                                             int maxIterations, 
                                             bool centerInitRandom,
                                             ILOutArray<double> outCenters) {
   using (ILScope.Enter(X, k)) {

// … (abbreviated: parameter checking, center initializiation)

while (maxIterations --> 0) {
    for (int i = 0; i < n; i++) {
        using (ILScope.Enter()) {
            ILArray<double> minDistIdx = empty(); 
            min(sum(abs(centers - X[full,i])), minDistIdx,1).Dispose();// **
            classes[i] = minDistIdx[0]; 
        }
    }
    for (int i = 0; i < iK; i++) {
        using (EnterScope()) {
            ILArray<double> inClass = X[full,find(classes == i)]; 
            if (inClass.IsEmpty) {
                centers[full,i] = double.NaN;
            } else {
                centers[full,i] = mean(inClass,1);
            }
        }
    }
    if (allall(oldCenters == centers)) break; 
    oldCenters.a = centers.C; 
}
if (!object.Equals(outCenters, null))
    outCenters.a = centers; 
return classes; 
   }

}

The algorithm iteratively assigns data points to cluster centres and recalculates the centres according to its members afterwards. The first step needs n * m * k * 3 ops, hence its effort is O(nmk). The second step only costs O(kn + mn), hence the first loop clearly dominates the algorithm. A version better taking into account available ILNumerics features, would replace the line marked with ** by the following line:

...
min(distL1(centers, X[full, i]), minDistIdx, 1).Dispose();
...

The distL1 function basically removes the need for multiple iterations over the same distance array by condensing the element subtraction, the calculation of the absolute values and its summation into one step for every centre point.

Matlab® Code

For the Matlab implementation the following code was used. Note, the existing kmeans algorithm in the stats toolbox has not been utilized, because it significantly deviates from our simple algorithm variant by more configuration options and inner functioning.

function [centers, classes] = kmeansclust (X, k, maxIterations, 
  centerInitRandom)

% .. (parameter checking and initialization abbreviated)
 
while (maxIterations > 0)
        maxIterations = maxIterations - 1; 
        for i = 1:n
            dist = centers - repmat(X(:,i),1,k); % ***
            [~, minDistIdx] = min(sum(abs(dist)),[], 2);           
            classes(i) = minDistIdx(1); 
        end
        for i = 1:k
            inClass = X(:,classes == i); 
            if (isempty(inClass))
                centers(:,i) = nan;
            else
                centers(:,i) = mean(inClass,2);
                inClassDiff = inClass - repmat(centers(:,i),1,size(inClass,2)); 
            end
        end
        if (all(all(oldCenters == centers))) 
            break;
        end
        oldCenters = centers; 
 end

Again, a version better matching the performance recommendations for the language would prevent the repmat operation and reuse the single column of X for all centres in order to calculate the difference between the centres and the current data point.

...
dist = bsxfun(@minus,centers,X(:,i));
...

FORTRAN Code

In order to match our algorithm most closely, the first FORTRAN implementation simulates the optimized bsxfun variant of Matlab and the common vector expansion in ILNumerics accordingly. The array of distances between the cluster centres and the current data point is pre-calculated for each iteration of i:

subroutine SKMEANS(X,M,N,IT,K,classes) 
!USE KERNEL32
  !DEC$ ATTRIBUTES DLLEXPORT::SKMEANS 

  ! DUMMIES
  INTEGER :: M,N,K,IT
  DOUBLE PRECISION, INTENT(IN) :: X(M,N)
  DOUBLE PRECISION, INTENT(OUT) :: classes(N)
  ! LOCALS 
  DOUBLE PRECISION,ALLOCATABLE :: centers(:,:) &
                   ,oldCenters(:,:) &
                   ,distances(:) & 
                   ,tmpCenter(:) & 
                   ,distArr(:,:)
  DOUBLE PRECISION nan
  INTEGER S, tmpArr(1)
  
  nan = 0
  nan = nan / nan
  
  ALLOCATE(centers(M,K),oldCenters(M,K),distances(K),tmpCenter(M),distArr(M,K))

  centers = X(:,1:K)  ! init centers: first K data points
  do  
    do i = 1, N       ! for every sample...
        do j = 1, K   ! ... find its nearest cluster
            distArr(:,j) = X(:,i) - centers(:,j)         ! **
        end do
        distances(1:K) = sum(abs(distArr(1:M,1:K)),1)    
        tmpArr = minloc ( distances(1:K) )
        classes(i) = tmpArr(1);
    end do
  
    do j = 1,K ! for every cluster 
        tmpCenter = 0; 
        S = 0; 
        do i = 1,N ! compute mean of all samples assigned to it
            if (classes(i) == j) then
                tmpCenter = tmpCenter + X(1:M,i); 
                S = S + 1; 
            end if     
        end do
        if (S &gt; 0) then 
            centers(1:M,j) = tmpCenter / S; 
        else 
            centers(1:M,j) = nan;
        end if 
    end do
    
    if (IT .LE. 0) then ! exit condition
        exit; 
    end if 
    IT = IT - 1; 
    if (sum(sum(centers - oldCenters,2),1) == 0) then
        exit;  
    end if 
    oldCenters = centers; 
  end do
  DEALLOCATE(centers, oldCenters,distances,tmpCenter);
end subroutine SKMEANS

Another version of the first step was implemented which utilizes the memory accesses more efficiently. Its formulation relatively closely matches the ‘optimized’ version of ILNumerics:

    ...
    do i = 1, N          ! for every sample...
        do j = 1, K      ! ... find its nearest cluster
            distances(j) = sum(                                &
                                abs(                           &
                                    X(1:M,i) - centers(1:M,j)))
        end do

        tmpArr = minloc ( distances(1:K) )
        classes(i) = tmpArr(1);
    
    end do
    ...

numpy Code

The general variant of the kmeans algorithm in numpy is as follows:

 
from numpy import *

def kmeans(X,k):
    n = size(X,1)
    maxit = 20
    centers = X[:,0:k].copy()
    classes = zeros((1.,n))
    oldCenters = centers.copy()
    for it in range(maxit):
        for i in range(n):
            dist = sum(abs(centers - X[:,i,newaxis]), axis=0)
            classes[0,i] = dist.argmin()
            
        for i in range(k):
            inClass = X[:,nonzero(classes == i)[1]]
            if inClass.size == 0:
                centers[:,i] = np.nan
            else:
                centers[:,i] = inClass.mean(axis=1)

        if all(oldCenters == centers):
            break
        else:
            oldCenters = centers.copy()

Since this framework showed the slowest execution speed of all implemented frameworks in the comparison (and due to my limited knowledge of numpys optimization recommendations) no improved version was sought.

Parameters

The 7 algorithms described above were all tested against the same data set of corresponding size. Test data were evenly distributed random mumbers, generated on the fly and reused for all implementations. The problem sizes m and n and the number of clusters k are varied according the following table:

	min value 	max value	fixed parameters
m	50		2000		n = 2000, k = 350
n	400		3000		m = 500, k = 350
k	10		1000		m = 500, n = 2000

By varying one value, the other variables were fixed respectively.
The results produced by all implementations were checked for identity. Each test was repeated 10 times (5 times for larger datasets) and the average of execution times were taken as test result. Minimum and maximum execution times were tracked as well.

Results

Ok, I think we make it to the results in this part! :) The plots first! The runtime measures are shown in the next three figures as error bar plots:

Execution speed comparison for variying k

Execution speed comparison for variying m

Execution speed comparison for variying n

Clearly, the numpy framework showed the worst performance – at least we did not implement any optimization for this platform. MATLAB, as expected, shows similar (long) execution times. In the case of the unoptimized algorithms, the ILNumerics implementation is able to almost catch up with the execution speed of FORTRAN. Here, the .NET implementation needs less than twice the time of the first, naïve FORTRAN algorithm. The influence of the size of n is negligible, since the most ‘work’ of the algorithm is done by calculating the distances of one data point to all cluster centres. Therefore, only the dimensionality of the data and the number of clusters are important here.

Conclusion

For the optimized versions of kmeans (the stippled lines in the figures) – especially for middle sized and larger data sets (k > 200 clusters or m > 400 dimensions) – the ILNumerics implementation runs at the same speed as the FORTRAN one. This is due to ILNumerics implementing similar optimizations into its builtin functions as the FORTRAN compiler does. Also, the efforts of circumventing around the GC by the ILNumerics memory management and preventing from bound checks in inner loops pay off here. However, there is still potential for future speed-up, since SSE extensions are (yet) not utilized.
For smaller data sets, the overhead of repeated creation of ILNumerics arrays becomes more important, which will be another target for future enhancements. Clearly visible from the plots is the high importance of the choice of algorithm. By reformulating the inner working loop, a significant improvement has been achieved for all frameworks.