Speed Comparison: Low Level Expressions
This article compares the speed of executing the low-level expressions from the last getting started guide with numpy and FORTRAN.
Date: December 2024, ILNumerics version: 7.1 (initial Accelerator Release)
We keep things very simple here: the expression is executed repeatedly with the same data and on the same hardware. Our goal is to get a good understanding of the relative execution speed required to run the same expressions on various platforms and technologies. We'll monitor the required execution times as they evolve over the app running time. We reimplemented the expression on three platforms, trying to be as fair as possible.
Whom to compare with and why?
The main goal of this comparison is to chose only such technologies which incur a similarly low level of effort to implement array expressions. The selection of tools is sparse, though.
We select numpy for this comparison, because of its popularity. Further, it offers a rather simple acceleration technique: in order to speed-up numpy expressions it is sufficient to decorate the function with a special attribute. The decision of selecting function candidates still remains with the programmer, though. In difference to other numpy accelerators (as for example: jax) numba imposes very few compatibility constraints, hence it brings comparable programming efficiency as ILNumerics.
FORTRAN, on the other hand remains to be seen as the holy grail of performance. It offers only basic array operations, lacking some features from numpy / ILNumerics. Hence, the implementation of the test expression required some manual loops in the FORTRAN code, without affecting measurement results, though.
Finally, we choosed to measure ILNumerics speed in three execution modes: running without acceleration, with default acceleration, and with the accelerator being configured to also use experimental features. They are now in preview phase and will soon be moved into the default configuration.
What has been measured ?
On each platform the same data have been created: 4 dimensional arrays of (arbitrary) size [507 x 10 x 5 x 17] with elements of type 'uint32'. The average time consumed by each invocation is measured in cycles of 1000 iterations each – over the first 10 seconds of application run.
The projects used to acquire the result are based on Visual Studio and PyCharm. Algorithms created with ILNumerics Computing (and the ILNumerics Accelerator) can run on any platform supported by .NET. However, the measurements can be reproduced on Windows due to the plot creation involved. Find all projects for download here.
ILNumerics Code
The relevant array expression part from the ILNumerics measurement is identical for all three measurements:
FORTRAN Code
To realize broadcasting we had to expand the inner expressions for FORTRAN. By doing so we have actually removed multiple temporary array results from the computation. Presumably, this has helped FORTRAN's performance a lot. An alternative formulation would have used FORTRAN's SPREAD() operation. This would have caused multiple, redundant copies of the full size array and the performance is expected to be slower by a factor of ~2 in this case. To evaluate this assumption is left to the reader as exercise.
Numpy Code
The numpy code is straight forward. We are only interested in the execution performance. Thus, we don't deal with numpy's nifty type promotion rules nor with its integer modulo on overflow. Doing so would (further) slow down things.
For the numba accelerator to compile without errors we had to put the inner expression (the actual measurement target) into its own function which is then decorated with the @njit attribute. Further, to satisfy type constraints in numba some variables had to be turned into a scalar array.
This implementation is closest to the original expression and gives a fair result:
Results
What are conclusions from the chart ?
The X axis corresponds to the wall clock: 0 marks the application start. At 10 seconds the measurement was stopped. Each point in the colored line plots displays the average execution time of invoking the array expression for 1000 times on the respective platform at the corresponding point in time.
Both: numba and ILNumerics Accelerator perform JIT compilation besides any 'built-in' JIT compilers/interpreters. Its effect is clearly visible from the line plots: both started with some moderate execution speed with application start. After some while the code was adopted and improved - leading to shorter (faster) execution times. For numba, this is true after the first iteration. For ILNumerics Accelerator adoption took almost 5 seconds for the initial accelerator run (red line). However, the JIT took less than 100 ms for the second accelerated run (green line).
The reason is that a large number of technologies must be loaded, jitted, loaded again, and get executed - to only starting to adopt low-level codes for the hardware found. Note, however, that this delay is only caused once. JIT compilations afterwards will be much faster as can be seen from the green plot at the bottom (which was measured in the same app as for the red line but after the first acceleration was completed.) Such JIT delays will be strongly decreased in future releases.
Note further, that there was a significant delay for starting the numpy codes (~2 sec). We assume that this is due to similar reasons: the runtime environment and the interpreter must be loaded and started.
After the JIT phase all execution runs performed smoothly and stable until the measurement ended after around 10 seconds.
Discussion
Not surprisingly, numpy (the top line in the chart) was slowest, still. Its accelerator numba brings a speed-up of factor ~2. The result is still roughly twice as slow as the code of ILNumerics.Computing - without acceleration. This speed roughly corresponds to ILNumerics.Computing version 6 and when running ILNumerics version 7 in Debug mode.
When running in Release mode (hence: with Accelerator enabled) the speed of ILNumerics (red line in the chart) is comparable to FORTRAN (the yellow line) at first. After the initial JIT adoption the speed automatically increases by a factor of ~2x. The speed of ILNumerics now reliably exceeds the speed of FORTRAN.
And this is not yet the limit! By configuring the experimental SpecializeFlags feature for the expression the speed-up further improves by a whole magnitude.
Note, that all this speed-up is gained by utilizing the CPU only! We have not even involved the GPU on the system, yet. While GPU computing is fully supported already, in version 7.1 automatic transitioning to the GPU is not yet enabled. It will be performed in a future release, automatically.
Read more
- Compare the efficiency of OpenMP inside Intel's MKL with ILNumerics auto-parallelization