Autonomous parallel execution for .NET array code
Modern hardware is massively parallel, but software production is still largely sequential. Compilers can often use SIMD vector units automatically. Beyond that fine granularity, parallel execution across cores, accelerator devices, and larger program regions usually requires manual engineering, because global static dependency analysis becomes too complex to solve safely and profitably in general.
ILNumerics closes this gap by replacing global static dependency analysis with local runtime dependency checks. Instead of searching a sequential program for safe places to parallelize, ILNumerics treats every array instruction as an autonomous execution unit that may run freely and concurrently, coordinating only with the instructions that directly produce its required inputs.
The core idea
In ordinary execution, the main thread walks through the program and executes array instructions in source-code order. This creates unnecessary waiting: the thread may spends time in long-running operations, while other independent work could already be prepared or executed elsewhere.
ILNumerics changes this model. The main thread no longer has to perform the full numerical workload itself. Its sole task becomes discovering upcoming array instructions, capturing their minimal dependency structure, and releasing them for autonomous execution.
The more work can leave the main thread, the more execution time can overlap. Sequential program time is compressed into concurrent and pipelined runtime behavior: independent instructions may run concurrently, while dependent instructions preserve only the local order required for correct results.
In practice, ILNumerics Accelerator can achieve speedups of 3x to more than 100x for high-level .NET numerical code, depending on workload and hardware configuration (benchmarks).
Autonomous array instructions
ILNumerics array instructions are not passive operations waiting for a central scheduler. Each instruction is an active execution unit with its own decision logic.
An instruction attempts to execute its workload as early as possible. At runtime, it decides when to run, where to run, and how to run, based on the operations it handles, its direct dependencies, data location, and available execution resources.
This decentralized model reduces the need for global synchronization. The main thread does not have to orchestrate every execution decision. It exposes the necessary dependency information, while autonomous instructions coordinate locally.
The execution net
The execution net is the moving window of array instructions that are currently active. It is established and constantly updated by the main thread as the program is visited in normal sequential order.
For each visited instruction, the execution net captures only the dependency structure required for correctness: which local inputs must be available before which work can proceed. It does not preserve the complete rigid source-code order when that order is not required by data dependencies.
At any time, the execution net may contain hundreds or thousands of active instructions. Within this window, instructions can execute, overlap, and move to suitable compute resources as soon as their local dependencies permit. Completed instructions leave the execution net, cleaning-up after themselves.
The main thread builds the minimum dependency structure; autonomous instructions turn it into parallel execution. Workload and scheduling are parallelized.
Array pipelining
Dependent instructions do not always need to wait for complete upstream results. Often, partial information is enough to begin useful work.
For example, the output size of a segment can often be determined from the shapes of its input arrays. Device selection may only require information about input shape and data location in order to estimate transfer and compute costs. These decisions can start before all input values have been fully produced.
By starting downstream preparation and execution as soon as partial input information is available, ILNumerics hides latency and overlaps dependent operations. This is similar in spirit to instruction-level parallelism in a processor pipeline, but adapted to multidimensional array data and whole-program regions.
Micro-JIT optimization
Autonomous execution decides when and where work should run. Micro-JIT optimization remains necessary to make that work efficient at the selected execution granularity.
Once an instruction commits work to an execution resource, ILNumerics adapts the operation to the low-level characteristics of that target. This includes temporary elimination, cache-aware execution, SIMD and vector-register-aware kernels, loop unrolling, and optimized numerical primitives.
This combines two layers of performance: program-level asynchronous execution and efficient sub-device execution.
Why this is different
Manual parallelization usually starts by asking the developer to identify independent work. This works well for clean cases such as independent loop iterations that map to Parallel.For().
But many technical programs are not clean loop-parallel candidates. Loop bodies may depend on previous iterations, update shared state, or contain ordered chains of array operations. In these cases, manual loop parallelization is often unsafe or impossible.
ILNumerics works differently. It does not require globally independent loop bodies. It preserves the sequential semantics of the original program, but does not preserve unnecessary sequential program order. Parallel potential is used across large regions of the running program, while execution is decomposed to fine-grained array instruction scope instead of coarse loop-body or function scope. This earlier and finer activation of work enables efficient overlap and latency hiding. It reduces synchronization to local array dependencies and lets all other execution order dissolve into asynchronous, concurrent, and pipelined runtime behavior.
This also enables automatic use of heterogeneous compute resources. As long as CPU execution resources are available, work may stay close to main memory. Once active instructions would otherwise wait for CPU slots, the cost model can decide that copying data to another device and computing there is faster than waiting. Heterogeneous execution emerges from runtime cost decisions, not from manual program partitioning.
The result is strong-scaling not only for huge workloads, but also for mid-sized and smaller array workloads where manual parallelization would often not be worth the engineering cost. Note, that in version 7 of ILNumerics the automatic heterogeneous computing feature is in experimental state. The Accelerator is configured for whole-CPU computing by default.
Summary
ILNumerics Accelerator achieves strong scaling for numerical .NET programs by applying automatic low-level kernel optimizations and high-level autonomous parallelization. The main thread builds a minimal dependency structure, while autonomous array instructions execute their workloads as early as correctness allows.
By removing unnecessary source-code sequencing, ILNumerics compresses sequential array execution into concurrent, pipelined, and massively parallel runtime behavior — with low-level optimization and automatic use of suitable compute resources.
What this means for developers
- Write NumPy- and MATLAB-style array code in .NET
- Keep readable, sequential program logic
- Avoid manual threading and task orchestration
- Avoid hardware-specific rewrites
- Benefit from low-level SIMD and cache-aware optimizations
- Expose parallelism across large program regions automatically
- Preserve full .NET compatibility and maintainability
Availability
The ILNumerics Accelerator is part of the ILNumerics Computing Engine (ILNumerics.Computing) on NuGet.org. It works transparently on common ILNumerics array code in Release mode and does not require user intervention. Optional configuration is available for advanced scenarios.
Documentation & Examples
The getting-started guides demonstrate key performance topics and provide benchmarks and comparisons:
The "Free Lunch" paper about the ILNumerics Accelerator is found here (pdf).
