Getting Started III - Pros & Cons

This information is outdated. The problems and issues being described here are removed in version 7.0.104 (Accelerator beta02). The Accelerator compiler is now 'On' by default. It handles all code regions and produces correct results. It never slows things down - but delivers great speed-up for supported array expressions.

~~One might wonder why the Accelerator needs to be explicitly enabled for source code regions. This article should make it obvious.~~

A Simple Expression

Let's consider the following, very simple algorithm. It creates a matrix of size [1000 x 2000]. While each column is constant, the numbers 0 and 1 are stored in each row alternatingly. One could attempt to create this matrix with the following, iterative approach:

Here, each column of the matrix is extracted. The existing values from the last column are negated and 1.0 is added to the result. It is then stored back into the same array at the column corresponding to the current iteration. All this is executed in a loop, iterating over the columns of the matrix.

When enabled for the loop the ILNumerics Accelerator creates a single segment out of the right side of line 4. It performs an negation and an addition. This segment will be called for each column of A, 1999 times. Now, since each iteration reads from A and stores into A, all invocations of the segment will be serialized: the second segment invocation will not start before the first segment invocation was completed and so forth...

The main purpose, of course, is to respect data dependencies between individual loop iterations. Because of this precaution it is safe to enable the accelerator, even for such loops, where reading happens from parts of the array which were stored in another loop iteration. However, while being safe - it is not especially fast!

Attempting to accelerate this loop will currently fail. In fact, the result will run much slower than the original version. The cause is found by the fact that there is no parallel potential within this algorithm. Still, the accelerator will attempt to execute all small segments asynchronously, to give them the chance to spread out to multiple cores. Unfortunately, they cannot. So, the latency of asynchronous execution is not hidden by subsequent segment invocations.

One could go ahead and improve the situation slightly. By configuring the maximum number of worker threads used for execution all segments are lined-up sequentially (which is required anyways). Now, we are back to common ground, performance wise. Still, speed will likely not be more convincing either.

A faster Approach

~~A faster way to create the same matrix, of course, would be:~~

We have not only removed the loop. There is also no data dependency in this expression. Hence, all columns can be computed individually. Attempting to accelerate this expression will indeed create a great speed-up (12x on our machine) compared to the non-accelerated code.

Take away

~~While above situation could certainly be improved - the Accelerator currently does not handle it too well.~~

~~Currently, the Accelerator focusses on speeding-up common bottlenecks in common algorithms. It brings great results for array expressions, where at least one of the following is true:~~

~~Expression involves some 'complexity'. For example: at least one reduction operation (sum(), max(), etc.).~~
~~Expression involves generator operations (ones(), zeros()), creating large, constant arrays.~~
~~Expression creates large temporary arrays.~~
~~Expression performs broadcasting, increasing array sizes.~~
~~Expression is executed with array input parameter(s) of significant size.~~
~~Expression exposes some parallel potential.~~

Support for small and 'cheap' expressions will improve over time. This is mainly a matter of identifying common patterns and implementing heuristics for them. Until we can be reasonably sure that all your code profits from it the Accelerator must be enabled for individual code regions.