ILNumerics Accelerator Compiler
The Accelerator Technology
ILNumerics Accelerator technology is based on novel methods for the autonomous parallelization of sequential array codes.
In this process, key decisions that would otherwise have to be made by a programmer are taken over by the executing computer – and when all the relevant data is available: at runtime. The programmer is released from any optimization task.
Potential for parallelization (i.e. “which parts of the code can actually be executed in parallel?”) exists at different levels of granularity in a program. ILNumerics identifies parallel areas on multiple levels and selects and uses the optimal, available hardware for most efficient execution. Heterogeneous compute resources are used, as appropriate.
The result: simple, high level array codes run on any hardware with optimal resouce utilization.
Execution Nets
ILNumerics breaks up the usual sequential execution of a program. Instead of processing the array instructions sequentially, each instruction is first integrated into a highly dynamic, volatile data structure (execution net) at runtime. The execution net contains a (often large) number of upcoming instructions and links them according to their actual dependencies.
The nodes of the execution net work autonomously. They decide independently which available (sub-)hardware they use in which way for their execution and when. To do this, they can access all relevant information: the data, the hardware and their states are known at this time.
While the main thread is constantly adding upcoming instructions to the execution net, the existing nodes are processed, using array pipelining (see below). All hardware resources available at runtime are therefore used concurrently. The parallel potential of the algorithm and the data is fully exploited.
Array pipelining to parallelize sequential programs
For 60 years, parallel execution units have played an important role in processors. With the help of these “pipelines” parts of sequential operations can be executed in parallel – thus become faster. ILNumerics applies this basic idea to high level array algorithms. Think of an 'array pipe' being the core of a CPU or any OpenCL capable device.
Independent operations choose an individual pipe for execution, thus execute simultaneously. Parts of dependent operations, on the other hand, begin execution as soon as its inputs are partially complete. This means that even programs that cannot be parallelized manually will benefit! See this example.
Low-level parallelism
When a node of the execution net has commited its workload to a pipeline it adopts its execution parameters to the low level specifics of the pipe. Here, ILNumerics uses so-called “micro-JIT” compilers to implement each nodes operations into highly optimized kernels for the selected hardware.
Micro-JITs are experts for certain predefined, numerical operations and are thus able to apply all essential optimizations: removal of intermediate results, loop unrolling, SIMD vectorization, constant folding, and newer methods, such as those for reducing latency. See this example.
For some operations and for full compatibility with existing libraries, however, this stage is skipped and predefined (third party) code is used instead. See this example.
Summary
ILNumerics identifies parallel potential during execution and on a fine grain level. It requires no manual control. The results dynamically adapt to any (heterogeneous) hardware and use it for execution more efficiently than manual methods. While speeding-up independent parts of the program, concurrent data access retains its sequential semantics (correctness guarantee). The methods scale equally well with the parallel capacity of the hardware for both: small and large data.
Current State, Availability
The ILNumerics Accelerator is available as part of ILNumerics.Computing on nuget.org.
The initial release of ILNumerics Accelerator focusses on fundamental unary, binary, reduce and generator array instructions, as well as more complex instructions: FFT and linear algebra on the CPU. As the ultimate goal all parts of an algorithm comprise of supported ILNumerics array instructions. Such regions will be utilizing all parallel compute resources - with optimal efficiency.
The documentation for ILNumerics Accelerator is structured into the following sections:
Getting Started Guide II: Speeding-up the k-means algorithm
Getting Started Guide III: Array Pipelining for Faster Fast Fourier Transforms (FFFT)
Getting Started Guide IV: Array Pipelining vers. Multithreading
Speed Comparison with numpy (numba) and FORTRAN
Table of supported Features & Roadmap
Don't miss our introductory blog article series: part 1, part 2.