Autonomous parallel execution for .NET array code

ILNumerics Accelerator changes the execution model for numerical .NET programs. Instead of starting from rigid sequential execution and searching for places where parallelism may be safe, ILNumerics starts from maximum asynchronous execution freedom.

Every array instruction is treated as independently executable by default. Correctness is then recovered by one simple rule: each instruction must respect only its direct local data dependencies.

ILNumerics starts from maximum execution freedom and constrains it only by correctness.

The core idea

In ordinary execution, the main thread walks through the program and executes array instructions in source-code order. This creates unnecessary waiting: the thread may block on long-running operations, while other independent or partially dependent work could already be prepared or executed elsewhere.

ILNumerics changes this model. The main thread no longer has to perform the full numerical workload itself. Its main task becomes discovering upcoming array instructions, capturing their minimal dependency structure, and activating them for asynchronous execution.

The more work can leave the main thread, the more execution time can overlap. Sequential program time is compressed into concurrent and pipelined runtime behavior.

From sequential execution to asynchronous work

The extreme form of this idea would be simple: execute all array instructions at once. Of course, that would only be correct if all dependencies were respected.

ILNumerics approaches this extreme safely. It gives array instructions as much execution freedom as possible, then restricts that freedom only where correctness requires it: at direct local data dependencies.

Instructions that do not depend on each other may execute concurrently. Instructions that are partially dependent may overlap in time as soon as the required parts of their input data become available. Instructions that are truly dependent keep the required local execution order.

Autonomous array instructions

ILNumerics array instructions are not passive operations waiting for a central scheduler. Each instruction is an active execution unit with its own decision logic.

An instruction attempts to execute its workload as early as possible. At runtime, it decides when to run, where to run, and how to run, based on its direct dependencies, available input information, data location, and available execution resources.

This decentralized model reduces the need for global synchronization. The main thread does not have to orchestrate every execution decision. It exposes the necessary dependency information, while autonomous instructions coordinate locally.

The execution net

The execution net is the moving window of array instructions that are currently active. It is built by the main thread as the program is visited in normal sequential order.

For each visited instruction, the execution net captures only the dependency structure required for correctness: which local inputs must be available before which work can proceed. It does not preserve the complete rigid source-code order when that order is not required by data dependencies.

At any time, the execution net may contain hundreds or thousands of active instructions. Within this window, instructions can execute, overlap, and move to suitable compute resources as soon as their local dependencies permit.

The main thread builds the minimum dependency structure; autonomous instructions turn it into parallel execution.

Array pipelining

Dependent instructions do not always need to wait for complete upstream results. Often, partial information is enough to begin useful work.

For example, the output size of a segment can often be determined from the shapes of its input arrays. Device selection may only require information about input shape and data location in order to estimate transfer and compute costs. These decisions can start before all input values have been fully produced.

By starting downstream preparation and execution as soon as partial input information is available, ILNumerics hides latency and overlaps dependent operations. This is similar in spirit to instruction-level parallelism in a processor pipeline, but adapted to multidimensional array data and whole program regions.

Micro-JIT optimization

Autonomous execution decides when and where work should run. Micro-JIT optimization remains necessary to make that work efficient at the selected execution granularity.

Once an instruction commits work to an execution resource, ILNumerics adapts the operation to the low-level characteristics of that target. This includes instruction fusion, temporary elimination, cache-aware execution, SIMD and vector-register-aware kernels, loop unrolling, constant folding, and optimized numerical primitives.

This combines two layers of performance: program-level asynchronous execution and efficient sub-device execution.

Why this is different

Manual parallelization usually starts by asking the developer to identify independent work. This works well for clean cases such as independent loop iterations that map to Parallel.For().

But many technical programs are not clean loop-parallel candidates. Loop bodies may depend on previous iterations, update shared state, or contain ordered chains of array operations. In these cases, manual loop parallelization is often unsafe or impossible.

ILNumerics works differently. It does not require globally independent loop bodies. It reduces synchronization to local array dependencies and lets all other execution order dissolve into asynchronous, concurrent, and pipelined runtime behavior.

This also enables automatic device crossing. As long as CPU execution resources are available, work may stay close to main memory. Once active instructions would otherwise wait for CPU slots, the cost model can decide that copying data to another device and computing there is faster than waiting. Heterogeneous execution emerges from runtime cost decisions, not from manual program partitioning.

The result is strong-scaling not only for huge workloads, but also for mid-sized and smaller array workloads where manual parallelization would often not be worth the engineering cost.

Summary

ILNumerics Accelerator reduces the time the main thread spends executing and waiting. The main thread builds a minimal dependency structure, while autonomous array instructions execute their workloads as early as correctness allows.

By removing unnecessary source-code sequencing, ILNumerics compresses sequential array execution into concurrent, pipelined, and massively parallel runtime behavior — with low-level optimization and automatic use of suitable compute resources.

Availability

ILNumerics Accelerator is available as part of ILNumerics.Computing on NuGet. It works transparently on common ILNumerics array code in Release mode and does not require user intervention. Optional configuration is available for advanced scenarios.

The current implementation focuses on fundamental array instructions such as unary, binary, reduction, and generator operations, as well as FFT and linear algebra on the CPU. The long-term goal is to extend autonomous execution across increasingly large regions of ILNumerics array code, allowing technical applications to use modern parallel hardware automatically while preserving the clarity of high-level numerical programming.

Start with the introduction and getting-started guides for installation, elementary expressions, k-means acceleration, FFT array pipelining, comparison with traditional multithreading, configuration options, supported features, and roadmap details.