Tag Archives: compiler

ILNumerics: High Performance as Default

ILNumerics releases Version 7.0

The ILNumerics team is proud and excited to announce the availability of a new major release of its ILNumerics Ultimate VS product line. This milestone represents a significant advancement in compiler technology: it introduces the first autonomously parallelizing array compiler, ILNumerics Accelerator.

Array codes, such as numpy and Matlab, are popular within the scientific community. They simplify the description of complex numerical algorithms. However, when these numerical codes are implemented into industrial-grade products, distributed to customers, and maintained by a larger development team for many years, existing tools reveal considerable limitations, with execution speed being possibly the greatest limitation.

ILNumerics Accelerator is here to resolve this issue once and for all. Built on the convenient language of ILNumerics Computing Engine for authoring scientific array codes directly in C# /.NET, our intelligent Accelerator compiler reformulates array codes and executes them in the most efficient manner, unguided and autonomously.

Short Dive: Building an Array Compiler …

The key to optimal hardware utilization today is, of course, the ability to efficiently parallelize the workload. Therefore, ILNumerics Accelerator compiler automatically detects and utilizes parallel potential within subsequent array instructions of your algorithm.

The prevalent approach to automatic parallelization today involves analyzing your code to detect instructions that can execute in parallel. The code is then rewritten to allow these independent instructions to run concurrently. This approach originated at a time when C/C++ and FORTRAN were considered ‘high-level’ languages. While these languages offer some basic array support, numerical algorithms written in them inherently deal with scalar data.

In contrast, the algorithms we handle use n-dimensional arrays as the fundamental data type. When striving for real efficiency, the arrays’ complexity prohibits any compile-time decision about data independence, a suitable execution device, or a low-level implementation using vector instructions, among other factors. Too many variables determine the optimal execution strategy, and their variation is too extensive to identify at development time.

… with a fresh Approach to Parallelization …

ILNumerics takes a data-driven approach to parallelization, making all crucial decisions ‘just in time’ at runtime. This method enables us to implement the low-level internals of array instructions efficiently and eliminates the need for expensive dependency analysis of large program segments. It’s easier to determine at runtime whether an instruction relies on other data and when that data will be available.

However, this increases the compiler’s complexity significantly. Instead of producing fixed code for an array instruction found in the original program, we now generate an intermediate program, often representing several subsequent array instructions. This intermediate program is then transformed at runtime to reflect the original instructions in a manner that respects all influencing factors and enables optimal efficiency.

… with Intelligence and Freedom

These programs all run asynchronously, scheduling themselves onto suitable devices for execution at runtime. They connect with previous and subsequent asynchronous neighbors on the fly, even as they progress in calculating results. They autonomously determine the optimal implementation on the device they select. Significant decision-making power comes with immense responsibility. The final program code builds up asynchronously and autonomously at runtime.

Executing the same code branch twice may result in very different machine instructions. As the creators of the compiler, we can’t predict many details of the final implementation. We can’t even forecast the actual number of asynchronous programs, which we refer to as segments, executing concurrently. The decision now rests with the computer. It has a better understanding of the situation than we do and can process information and make educated decisions much faster and more effectively.

Why another Beta, then?

The first autonomous array compiler has arrived. Are you eager to try it out? Great! Go ahead and test the current beta from nuget.org. Its performance is genuinely impressive.

The majority of our users are building industrial products and have chosen ILNumerics for its reliability and seamless integration into .NET (which means all of .NET, since 2005). The compiler will become a regular part of the ILNumerics Computing Engine once it’s ready. Currently, it’s like a young child, learning new things each day. It’s learning to walk and talk, occasionally stumbling over previously unknown obstacles. Sometimes, it even surprises its creators with unexpected efficiency.

With recent advancements, the Accelerator compiler is now set to ‘always on’ mode. This marks a significant milestone in the development of a compiler. It’s now trusted to be stable and flexible enough to handle all codes, even unseen ones, ensuring it produces correct results and serves its primary purpose: to enhance efficiency.

The compiler will continue to develop until the final release. It will become more stable and gain more real-world exposure. We would greatly appreciate if your feedback contributes to its maturity by then.

All previous modules of ILNumerics have been adapted for the new Accelerator. They are thoroughly tested, have undergone many bugfixes and improvements, and are available for production on nuget.org as of today.


 

Introducing the ILNumerics Accelerator:

[ Video does not show? Download here: https://ilnumerics.net/media/andere/ILNumerics_Segments_2022.mp4 ]

ILNumerics Accelerator – A better Approach to faster Array Codes, Part I

A Truth most Programmers won’t tell.

TLDR: It was twenty years ago when computer manufacturers, while trying to boost CPU performance, were confronted with the hard wall imposed by physical limitations. It took a long time for everyone involved to realize that individual processors could no longer become significantly faster in the future. The previous approach made it too simple to deal with the constantly growing demands caused by ever larger data and ever more complex programs. Haymo Kutschbach, founder and CEO of ILNumerics explains the challenges involved for automatic parallelization on today’s hardware and why traditional compilers will not satisfy the strong demand for a feasible solution. A better, working approach is presented.

Reading time: 10 minutes

For many decades, the rule applied: “Your program is too slow? Simply buy the latest hardware!” This was made possible by the availability of (general purpose) programming languages, as well as a stream of innovations in processor architectures and the associated compilers. They allowed the programs to be written largely independently of the execution hardware. The compiler adapted abstract codes to the low-level features of the processors. Everything fit together so conveniently!

And then nature threw a spanner in the works on this easy calculation. We’ve seen new computers continueing to obey Moore’s Law and become more powerful with each generation. But this new computing power is now distributed over many processors! A program can only use the new speed if it can use all processors at the same time. In addition, a significant part of the computing power of today’s computers is found in heterogeneous architectures such as GPUs and other accelerator hardware.

The consequences can be observed today in probably every development department: Programs are written abstractly. Efforts are made to create ‘clean code’ that can be tested and maintained. Unfortunately, it is often executed too slowly, though! So you start to examine the codes for optimization potential. You identify individual bottlenecks, decide manually on a parallelization strategy, implement, test and measure again. Many programmers don’t see this as a problem – their expert knowledge guarantees them a good income for many years to come! Since there are no alternatives, the enormous delays in time to market are accepted. Just like the fact that the next generation of devices will again require a complete rewrite for large parts of the software.

Hardware today is far more diverse and heterogeneous than it was 20 years ago. But why is that actually a problem? Why are we still not able to automatically adapt and run our programs on such hardware again?

If it’s too slow, blame the Compiler!

For one, it’s because the compiler market has traditionally been dominated by processor manufacturers. Their compilers served to make their own hardware more accessible. Even though we’ve seen a recent trend towards vendor-independent compilers, they still continue the traditional approach: compiling (“lowering”) code of a general purpose programming language (like C or Fortran) to hardware instructions that can be executed on a specific (and previously selected) processor architecture. What we actually need, however, is a compiler that translates an abstract program for a “computer” – with all its diverse computing resources.

This task, however, is much more demanding! Why? Here we come to the second reason that has so far prevented a working solution: granularity. In order to be able to use parallel resources, the software must contain additional instructions that take care of the distribution of the individual program parts. This additional overhead makes a certain minimum program size (workload) mandatory. Parallelization only makes sense if the gain in speed through parallel execution exceeds the additional management effort!

Here, the term “granularity” refers to the number of individual low -level instructions that a piece of code requires to calculate its result. Apparently, the granularity increases with the size or complexity of a program part, because more low-level instructions are executed at the end. So, in order to make efficient use of parallelism, it is not enough to consider individual instructions of a program. Rather, a compiler must merge larger program parts (‘grains’) and distribute them to the available computing resources . One of the most important challenges of a parallelizing compiler is to find the optimal size of these “chunks” – and thus the optimal granularity.

This brings us to the third reason: complexity. In general, programs can become arbitrarily complex. Compilers have the task of translating programs into another, useful form. But how does one translate information that can be arbitrarily complex? Traditional compilers achieve this goal by limiting their consideration to individual instructions in the program. They know the (manageable) set of possible translations for each of these instructions. The finished program is then little more than the chain of such individual translations. Nevertheless, traditional compilers are already considered to be extremely complex software!

Unfortunately, this “complexity reduction” trick is in direct contradiction to the necessary, minimal granularity described above. A parallelizing compiler must face the challenge of finding an optimal translation even for program chunks that consist of multiple or many instructions. As always, low hanging fruits exist (see e.g.: tensorflow). However, an optimal solution must not restrict itself to individual, specific instructions just because their translation is particularly simple and therefore easy to implement!

And then there is a fourth aspect that also plays into the area of complexity. In addition to the optimal chunk size (granularity), it is also important to identify those grains of a program that can be executed independently of one another. The program analysis required must also take into account the best possible granularity. A compiler must therefore be able to understand large program sections, form chunks of an optimal grain size and execute them efficiently in parallel on heterogeneous hardware.

Ultimately, it is the complexity of this task that has so far prevented such a compiler from being born. In general, development departments still follow the manual approach described earlier. It requires expert knowledge and enormous effort to manually perform the necessary steps (granularity and independence analysis, parallel implementation) for a special program . But despite the enormous amount of time, despite the enormous maintenance issues associated, this approach allows to scratch the surface only! It can neither address nor solve the real problem.

Are we lost?

Not yet! Let’s wrap it up:

  • Traditional compilers work at too low a level of granularity. They consider too few instructions to cover a workload, suitable for efficient parallelization.
  • Previous attempts to include large/all program parts in an analysis have only been successful in a few special cases. Global analysis for general programs is an unsolved problem and, in the opinion of the author, will remain so for a long time to come.

But now we come to the topic of our headline! For a not too small class of programs ILNumerics succeeded in developing a compiler that fulfills all optimality requirements. It is the class of numeric, array-based programs. They are written by scientists, mathematicians, and engineers to encode the innovations of our modern times. They form the core of big data, machine learning and artificial intelligence, and all codes, making use of them. Numerical algorithms realize innovations – moreover: they are these innovations! They allow to formulate great complexity without great effort. They handle huge data and, therefore, require greatest speed.

A development department, faced with the challenge of accelerating its numerical array codes can only select and use the best possible language and libraries there are. Technical requirements often rule out prototyping languages, like numpy and Matlab. Individual parts may be outsourced to GPUs. Other parts are parallelized using multithreading. All of this slows down the industrial development process significantly. It is like writing GUI programs in assembler…

A Better Approach to faster Array Codes

ILNumerics proposes a new, faster approach to the execution of array codes. ILNumerics Accelerator, automatically finds parallel potential in array-based algorithms (as: numpy, Matlab, ILNumerics) and efficiently distributes its workload to heterogeneous computing resources – dynamically and at runtime, when all important information is available. It again scales your algorithm speed to any heterogeneous hardware without recompilation.

The online documentation describes technical details of the ILNumerics Accelerator Compiler. Make sure to register for our newsletter here and don’t miss any news!

Where can I get it ?

ILNumerics Accelerator has entered the public beta phase. It will be released with ILNumerics Ultimate VS version 7. The pre-release is available on nuget. Start here ! General documentation on the new Accelerator is found here. It will subsequently be completed within the next weeks. Read the getting started guide(s), get your hands dirty and please, let us know what you think!

Patents

ILNumerics software is protected by international patents and patent applications. WO2018197695A1, US11144348B2, JP2020518881A, EP3443458A1, DE102017109239A1, CN110383247A, DE102011119404.9, EP22156804. ILNumerics is a registered trademark of ILNumerics GmbH, Berlin, Germany.

Julia, Math .NET M#, FORTRAN .NET, managed LAPACK, MKL and outlook

With the recent advances in the ILNumerics core module we were able to improve the computational part of our libraries a lot. Not only was the execution speed increased by magnitudes – while catching up with C++ and FORTRAN the .NET platform gets more attracting to an even wider community of scientists, engineers and programmers of numerical applications.

We find ourself as part of a very exciting evolution. A whole bunch of young and not so young projects are targeting similar goals like ILNumerics: convenience and performance. One interesting among them is the Julia language. A language, very similar to the MATLAB syntax (hence to ILNumerics’ syntax as well) is combined with a JIT compiler from the LLVM suite (what else?). While the convenience of the language is out of question the speed provided by the LLVM JIT is “in the range of 2x C++”. The language is dynamic which marks an important difference to ILNumerics.

Interestingly enough, one of the developers of Julia have been involved into the creation of M# (according to this blog post):

Jeff [Bezanson] was a principal developer of M#, an implementation of the MATLAB language running on .NET

And this is where it starts getting even more interesting. Consider, having a compiler
for ‘ILM#’ (an imaginary extension of Julia/MATLAB with typesafety), outputting .NET IL code and at the same time incorporating the deterministic disposal patterns of ILNumerics! However, I have not been able to find any working MATLAB-to-.NET compiler yet and no M# project either. Anyone out there knowing where it lives today?

The idea of being able to convert complete MATLAB code branches into ILNumerics libraries, making them run at the speed of C/FORTRAN is very appealing indeed. And there is another potential language as conversion source: FORTRAN itself! While a lot of developers value the platform independence and convenience of C# over FORTRAN (especially if it comes to GUI development or even RAD) – they argueably will not love the idea of rewriting all their grown-over-the-years FORTRAN algorithms again in ILNumerics. Having the option to automatically convert that code into C#/ILNumerics would not only save them from PInvoking into native FORTRAN libraries, but even make that code run on all platforms supported by .NET.

Having this in mind, I recently did some searching for matching projects. The two attempts I found:

  • Lahey Fujitsu, LF .NET Fortran compiler. Seems to be discontinued?
  • Silverfrost FTN95: Fortran 95 for Windows

I did some tests with FTN95. With some help of Paul Laider from Salford I have been able to create a ‘fully managed’ LAPACK version right from the netlib sources with only very minor modifications to the official FORTRAN code. I say ‘fully managed’ because at the end, you’ll get a real .NET assembly. However, the compiler comes with some drawback IMO which I will wirte about in a later post.

However, this brings us further to one of our goals (and to the last CAPITALIZED buzzword from our headline): not having to rely on MKL anymore. Since we have been able to speedup the matrix multiplication to around half the speed of the MKL, having all the LAPACK stuff within C# marks a next milestone. when all is finished, the user will have the option to choose from these deployment schemes:

  • ILNumerics fully managed version. Suitable for Silverlight, Office Addons, Visual Studio Plugins etc., <8 MB, all platforms supported, no native libs
  • ILNumerics 32 or 64 bit, with native support, platform specific, around 2 times faster, considerably larger binaries

And this is still without potential improvements on the “half the speed of MKL” issue …

As always: any comments welcome.