ILNumerics Accelerator: Getting Started, Part IV
Array Pipelining vers. Multithreading
ILNumerics Accelerator enables efficient execution of common array codes on heterogeneous hardware. Array pipelining is one power tool to its efficiency. In the last article of our getting started guides we have shown how the high-level execution of array pipelining allows to more efficiently execute FFT calculations. But array pipelining brings even more advantages. Let's find out and compare array pipelining to traditional multithreading techniques! We will define a simple algorithm and compare its performance when run by traditional threading with array pipelining.
In the end we'll provide a comprehensive list of advantages of array pipelining.
The traditional Approach: Parallel.For Loops
Since programmers are in need of writing concurrent programs correctly, many technologies came to light with the aim to help them succeed. While all such technologies simplify the distribution of workload to other tasks, threads, or processes they all increase complexity for the programmer due to additional requirements on the algorithm and the workload. The programmer must ensure, that what formerly executed sequentially will now bring the correct result even when parts of the program run concurrently. Because this task is demanding enough parallelization is often left to performance experts.
One popular approach on the .NET platform is to use the Parallel.For method of the Task Parallel Library (TPL). It executes a certain action on multiple threads concurrently. The TPL handles all necessary overhear, like: thread resource management, synchronization and error collection and reporting. However, the programmer must ensure, that all objects and code which are shared between the threads are thread safe, that the original intention of the program is retained even when executed in parallel, so that such parallel execution does not lead to unwanted side effects or gives incorrect results, and that performance goals are met.
Here is our example algorithm:
This simple loop iterates over the columns of a matrix A. It calculates the sine values of elements of A from a region including all columns starting with the first column and ending with the current column. The results from all columns are added (sum along dimension: 1), the absolute value is calculated and the result is stored into another array.
Let's see how this performs as sequential code first! We will use the same setup as before: Visual Studio 2022, .NET 8.0, ILNumerics.Computing 7.0. Add the following code to a new Console Application, using the nuget package ILNumerics.Computing. The full project can also be downloaded from here.
After compiling in Release mode and when running without debugger attached, it gives something like this:
Sequentially executing this code took ~155ms on our test computer.
Now, we'll add a parallelized version. We do the paralellization manually, using Parallel.For. Our simple approach is to replace the full iteration over the second dimension with multiple small iterations which run in parallel. The new code:
We have extended our run() function with the parallel version, using Parallel.For. The body / the workload is defined as an anonymous function, which is a very common approach. This body is executed at runtime on multiple threads in parallel.
The body of the anonymous function is exactly the same as in the sequential version. Note, how the array variables A and A_Parfor are captured in the lambda: the same single instance is shared and used concurrently by all threads running in parallel. A is used for concurrent reading. A_Parfor is used for concurrent writing. If you wonder, how such sharing of complex objects like Array<T> is possible to start with: ILNumerics is thread safe since version 7.0.163.)
The run() function now returns the measured times for both loops in milliseconds. In the rest of the code we call this function multiple times to let the system stabilize. In the end, each measurement is displayed – but not without validating correctness! The check() function receives the accumulated error for a measurement and decides between 'correct' and 'error' based on a threshold. In an error situation the value of the maximum difference to the correct result is displayed:
Obviously, we have some good news: the parallel version completed without crashing the program. It also runs significantly faster: ~7 times speed-up due to efficient multithreading. This is even true for such invocations, which produced the correct result. But, unfortunately, only 6 out of 20 runs did produce the correct result. And this is obviouosly not very useful...
A closer Look
As a performance expert you already know the reason for this failure. Each invocation of the array expression overwrites a certain region of the result array A_Parfor. The first iteration overwrites the first column. The last iteration overwrites the whole array. This algorithm depends on the order of writes to the resulting array. Parallel.For, however, executes the action body in arbitrary order. All common parallelization techniques today do.
Our first, naïve attempt for parallelization will not work! There are ways to work around, though: we could collect all results into individual arrays and merge them afterwards, in sequential order. We could iterate in parallel by manually splitting the first dimension into smaller chunks and compute all iterations of those chunks in parallel. Or, we could use our intrinsic knowledge about the algorithm and skip all but the last iteration results. However, instead of approaching this common, demanding decision making and often painfull re-implementing we'll use a better solution instead. A solution which not only solves this problem but all problems related to parallel, out-of-order execution and thread safety.
The modern Approach: Array Pipelining
We'll keep the failed Parallel.For approach in the example code for a while, since it gives a good hint, which rough execution times we could expect after fixing it. And we add a new version of execution for the ILNumerics Accelerator.
The original array expression was duplicated and the new section was marked with "//ILN(enabled = true)". Display- and check functions have been adopted accordingly. No other changes are required to invoke the optimizations:
We display the timings for each measurement run in three columns: sequential run, Parallel.For and the Accelerator version:
ILNumerics Accelerator more efficiently distributes the workload for the calculations, hence runs with the same or even faster speed as Parallel.For. However, it does not change the order of execution of the array instructions! The order of writes to the shared result variable A_Acc is the same as for the sequential version. Thus, the result is guaranteed to be correct for all invocations.
Array Pipelining Advantages
Throughout the last two articles we have demonstrated multiple advantages array pipelining brings. We can summarize them as follows:
- Parallel potential is automatically identified and made use of.
- Resulting parallelization is more efficient – regardless of data size.
- In-order execution keeps the original, relative order of array instructions. 'Relative' means: instructions which depend on each other or access the same data run sequentially. All other instructions run concurrently.
- Second-order dependencies (i.e.: dependencies of dependencies) run concurrently.
If this does not yet sound convincing enough for you, we can add to this list:
- Latency hiding: approaching parallel hardware almost always comes with additional effort. Memory must be provided to the hardware, instructions need to be re-written. Due to its unique asynchronous execution technique array pipelining is often able to hide this latency completely. This enables us to approach not only low-latency parallel techniques (as multithreading or SIMD vector parallelization) but also GPU devices and other accelerator hardware! (This feature is already contained in the current version but remains in an experimental state for the first release. Stay tuned for more details!)
- No thread safety requirements: code to be accelerated by ILNumerics Accelerator is not required to be thread safe! It is only the workload which runs in parallel. And this part is transparent to the user. User's code, however, will run only once, sequentially and in the same order as it was written in. Likewise, objects shared in accelerated code are not accessed by multiple threads (unless the program explicitly does), hence are not required to be thread safe. For example, consider a shared counter variable i in the Parallel.For version above: safely incrementing i inside the parallel loop would require us to increment atomically, by: Interlocked.Increment(ref i). ILNumerics accelerated code does not require such precautions. A regular i++ will do.
- Applicable for arbitrary code: array pipelining optimizations are whole program optimizations. They are applied to any code consisting out of array instructions, regardless if they are defined inside or outside of loop- or other code blocks. They even work for code spanning multiple functions or multiple assemblies, without the need for user interaction.
- Applicable for arbitrary granularity: by accumulating larger workloads (ILNumerics: segments = multiple array instructions) we can automatically include multiple, heterogeneous compute resources (CPUs, GPUs and even multiple nodes). And by breaking down individual array instructions into sub-tasks we can apply array pipelining to these sub-tasks. This way, even dependent array instructions start executing in parallel!
Conclusion
Array pipelining executes independent instructions in parallel and retains sequential semantics for dependent instructions. Often, all parallel potential within an algorithm is identified and made use of. This not only brings much better execution times. The scheme is also applicable to any part of an array algorithm, beyond "embarrassingly parallel" loop bodies. All important decisions about when which device executes which workload are made autonomously at runtime – without user intervention. Code automatically adopts to any hardware found.