On Not Optimising for Last Century's Hardware

Once upon a time I wrote a super-optimised algorithm for rotating data in an array. At least, it was meant to be super-optimised, but its real-world performance turned out to be terrible. That’s because my intuition about performance was stuck in the 20th century:

Break a program down into basic operations like multiplication and assignment
Give each operation a cost (or just slap on an O(1) if you’re lazy)
Add up all the numbers
Try to make the total in step #3 small

A lot of textbooks still teach this “classic” thinking, but except for some highly constrained embedded systems, it just doesn’t work that well on modern hardware.

The original version of this algorithm was in C, but I’m going to demonstrate my failure using D to keep the code simpler.

The Problem

We need a routine that can rotate an array. A rotation by 1 shifts all elements across one place, with the element that “falls off” the end being put back onto the other end. A rotation by $d$ is like doing the same thing $d$ times, but we’d like to implement it as one transformation if possible. I’m defining rotations to be to the “left”, so that a rotation by $d$ puts the original (0-based) element number $d$ at the front.

Note that the shift distance can be larger than the array itself. If the array has length $N$ , then every shift by a full $N$ gets us back to the original array, so we can simplify large shifts by just dividing by $N$ and taking the remainder (using the % operator in D).

Baseline Solution: Allocate and Copy

Rotating the array is the same as splitting the array at a point and swapping the two pieces. (If you’re not sure about that, grab some playing cards and try it out — seriously, it’s a great way to grok any array transformation.) One simple way to implement that is to allocate some temporary working space, copy the data into it in the right order, then copy the data back into the original buffer. Here’s a no-brains translation of that into code:

void rotateAllocCopy(int[] buffer, size_t delta)
{
        if (buffer.empty) return;
        delta = delta % buffer.length;
        auto temp = new int[buffer.length];
        auto temp_remainder = copy(buffer[delta..$], temp);
        copy(buffer[0..delta], temp_remainder);
        temp.copy(buffer);
}

I’m deliberately pessimising the memory management here by allocating a new temporary buffer that’s only freed on garbage collection. If I did the more natural thing and freed the buffer at the end of the function, and then benchmarked rotations in a tight loop, I could expect the same temporary buffer to just bounce on and off the underlying malloc implementation’s free list, which wouldn’t be giving my “optimised” solution a fair chance to shine.

The “Optimised” Solution

There are two obvious targets for speeding up the baseline solution: the heap memory allocation and the wasteful copying. We need at least some temporary working space because otherwise we can’t move any data around without clobbering other data. But all we need is space for one element. This algorithm is theoretically interesting, but practically really bad, so I’ll describe it in full, but skip the details if you like.

Say we start at the first element. We copy that into the working space because there’s nowhere else it can go, which leaves the first element of the array free to be overwritten. There’s some element that’s meant to end up there, so we might as well shift it over now, leaving space for another element to move into its place. In this way, we can keep working around the array, jumping by $d$ places at each step, possibly wrapping around the end of the array a few times. Eventually we’ll copy the original element from temporary storage into its proper place.

Are we done at this point? Not quite. If we use a bit of number theory, we can show that the above hopping around will create a cycle containing $N/\gcd(N, d)$ elements (where $\gcd$ is the greatest common divisor), so we need to run $\gcd(N,d)$ cycles starting at different appropriate locations. Conveniently, the same number theory proves that once we get back to where we started, the next element along in the array is a starting point for the next loop.

Minimal copy rotation of an array — Rotating a 10-element array by 4 as much as possible using one "wasted" copy. This attempt manages to put all the even-indexed elements into the right place. Another cycle is needed for the odd-indexed elements.

If the array length is a power of two, we can replace all the more complex calculations with super fast bit manipulations. Since this is meant to be a super fast algorithm, let’s make that restriction.

void rotateMinimalCopy(int[] buffer, size_t delta)
{
        if (buffer.empty) return;

        // Look, Ma, no heap allocation!

        auto len = buffer.length;
        // This code only works if the buffer length is a power of two (or empty)
        // Check that using a bit manipulation trick
        enforce (((len - 1) & len) == 0);

        // If len is a power of two, this is the same as delta % len
        delta = delta & (len - 1);

        if (delta == 0) return;
        // The buffer length is a power of two, so the gcd of it and anything must also be a power of two
        // Because delta < len, gcd(delta, len) is the lowest set bit of delta
        auto num_loops = delta & ~(delta - 1);

        // With all these bit tricks, this code better be really fast

        foreach (start; 0..num_loops)
        {
                auto t = buffer[start];
                auto idx = start, next_idx = (start + delta) & (len - 1);
                do
                {
                        buffer[idx] = buffer[next_idx];
                        idx = next_idx;
                        next_idx = (idx + delta) & (len - 1);
                } while (next_idx != start);
                buffer[idx] = t;
        }
}

This is obviously a lot more complicated and less “stupid” than the first version. How does it perform? Before looking at the benchmarks, let’s look at one more algorithm.

An Elegant Solution

After writing that algorithm, I came across another approach. I mentioned earlier that rotating an array is the same as splitting the array and swapping the pieces. Reversing the array would also swap the order of the pieces, but of course the elements inside the pieces would be in reverse order, too. That’s easily fixed by doing extra reversals of the pieces themselves. Reversing can be done in place, so no allocation is required.

void rotateReverse(int[] buffer, size_t delta)
{
        if (buffer.empty) return;
        delta = delta % buffer.length;
        reverse(buffer[0..delta]);
        reverse(buffer[delta..$]);
        reverse(buffer);
}

It’s even simpler than the first version. When I first saw this algorithm, I thought it was cute but obviously much less efficient than the minimal copy approach (it does even more copying than the allocate-and-copy approach). Thankfully, I tried it out.

The Benchmarks

I’ve reproduced some benchmarks using the D code above:

Rotation algorithm benchmarks — Time to rotate arrays of lengths between 1 element and 134,217,728 (128Mi) elements for three different algorithms. Note the log/log scale; a small gap between lines represents a significant difference.

The rotation by reversal algorithm has rock-solid linear performance above about 64 elements. Under ~1Ki elements, the allocate-and-copy algorithm time is dominated by the allocation. Its performance wobbles around linear growth but eventually is about twice as slow as the reversal algorithm. The minimal copy algorithm has similar performance to the reversal algorithm above a few hundred elements, but above 2Mi elements it diverges and becomes ten times as slow. It’s never faster than the reversal algorithm.

Please don’t take these benchmarks too seriously (remember, the allocate-and-copy implementation was really stupidly coded). The point is that the super “optimised” algorithm isn’t better than two algorithms that are really easy to implement and verify. In fact, it’s an order of magnitude slower for large arrays.

The Postmortem

The simple reason the “optimisations” didn’t make the code faster is that it’s bad at a higher level than I was optimising. Bit operations won’t make bogosort faster than quicksort. As I said in the beginning, my thinking was stuck in the 20th century. The “classic” approach of adding up total operation cost works for a machine that crunches operations one at a time, but most computers haven’t worked like that in decades. I’m not just talking about multi-core processors. Pipelines, memory caching, prefetching, out-of-order processing, microcode architectures, write buffering, SIMD, etc, all add a kind of parallelism to modern computers. But this parallelism needs smooth, predictable program flow to work efficiently.

Computer performance has pretty much done a 180 turn. After Big O, the big thing in performance was maximising entropy. An if statement that’s almost always true is almost redundant, so making everything as unpredictable as possible minimises redundancy and makes every operation count. That was the 20th century. Now we need to make operations as predictable as possible, to maximise the amount of work the CPU can do.

The two good algorithms process memory in very simple patterns (mostly straight sweeps). The bad algorithm jumps all over the place.

To get a more concrete picture of what the performance problems are, we need to use a profiler designed for modern optimisation. This will give us counts of “events” that disrupt the CPU’s processing of the code. The easiest tool for Linux-based systems is perf. This will dump a bunch of performance statistics for the rotate benchmark program:

$ perf stat ./rotate

There are more statistics available. To list them:

$ perf list

Then you can get the specific stats with something like this:

perf stat -e cache-misses,page-faults ./rotate

(perf can do a lot more sophisticated things by the way.)

Cache misses and page faults are the most interesting stats here, which you might suspect from experience, or just guess by looking at all the stats. Because of its irregular memory access patterns, the minimal copy algorithm suffers from a lot more memory cache misses than the other two. The allocate-and-copy algorithm suffers more cache misses than the reversal algorithm, as well as more page faults. This is probably because it touches more “cold” memory that hasn’t been used recently — because of both allocating new memory and triggering garbage collections — and touches more memory overall.

Cache miss and page fault counts per rotation algorithm for arrays of 1Mi (1048576) integers. The results have been averaged over ten iterations.
Algorithm	Cache Misses	Page Faults
Reversal	3,340,210	1653
Allocate and Copy	5,061,042	18,046
Minimal Copy	25,839,824	1653

Source Code

If you want to try it out for yourself (there are several ways to make the allocate-and-copy algorithm faster) here’s the full code listing. I compiled the code for the benchmarks with LDC, the LLVM-backed D compiler:

$ ldc2 -O2 -boundscheck=off rotate.d
$ ./rotate --help
$ # rotate a 1024 element array by 99, five times, using reversal
$ ./rotate --size 10 --delta 99 --iterations 5 --algorithm reverse

The default output is the average time in microseconds.

// Array rotation algorithm example benchmark
// https://theartofmachinery.com/2016/11/20/optimising_for_modern_hardware.html
//
// Licence:
// https://opensource.org/licenses/MIT

import std.algorithm : copy, reverse;
import std.array : array;
import std.datetime : StopWatch;
import std.exception : enforce;
import std.getopt : defaultGetoptPrinter, getopt, GetOptException;
import std.range : iota;
import io = std.stdio;


void rotateAllocCopy(int[] buffer, size_t delta)
{
        if (buffer.empty) return;
        delta = delta % buffer.length;
        auto temp = new int[buffer.length];
        auto temp_remainder = copy(buffer[delta..$], temp);
        copy(buffer[0..delta], temp_remainder);
        temp.copy(buffer);
}

void rotateReverse(int[] buffer, size_t delta)
{
        if (buffer.empty) return;
        delta = delta % buffer.length;
        reverse(buffer[0..delta]);
        reverse(buffer[delta..$]);
        reverse(buffer);
}

void rotateMinimalCopy(int[] buffer, size_t delta)
{
        if (buffer.empty) return;

        // Look, Ma, no heap allocation!

        auto len = buffer.length;
        // This code only works if the buffer length is a power of two (or empty)
        // Check that using a bit manipulation trick
        enforce (((len - 1) & len) == 0);

        // If len is a power of two, this is the same as delta % len
        delta = delta & (len - 1);

        if (delta == 0) return;
        // The buffer length is a power of two, so the gcd of it and anything must also be a power of two
        // Because delta < len, gcd(delta, len) is the lowest set bit of delta
        auto num_loops = delta & ~(delta - 1);

        // With all these bit tricks, this code better be really fast

        foreach (start; 0..num_loops)
        {
                auto t = buffer[start];
                auto idx = start, next_idx = (start + delta) & (len - 1);
                do
                {
                        buffer[idx] = buffer[next_idx];
                        idx = next_idx;
                        next_idx = (idx + delta) & (len - 1);
                } while (next_idx != start);
                buffer[idx] = t;
        }
}

int main(string[] args)
{
        size_t iterations = 100;
        uint log2_buffer_size = 20;
        size_t delta = (1<<log2_buffer_size) / 3;
        string algorithm = "reverse";
        bool show_timing = true;
        bool show_result = false;
        auto rotate = &rotateReverse;

        void selectAlgorithm(string option, string algorithm)
        {
                switch (algorithm)
                {
                        case "reverse":
                                rotate = &rotateReverse;
                                break;
                        case "mincopy":
                                rotate = &rotateMinimalCopy;
                                break;
                        case "alloccopy":
                                rotate = &rotateAllocCopy;
                                break;
                        default:
                                throw new GetOptException(`Algorithm options are "reverse", "mincopy" and "alloccopy".`);
                }
        }

        try
        {
                auto getopt_result = getopt(
                                args,
                                "iterations|i", "Number of iterations to run benchmark", &iterations,
                                "size|s",  "Log base 2 of array length", &log2_buffer_size,
                                "delta|d", "Number of elements to shift array by", &delta,
                                "algorithm|a", `One of "reverse", "mincopy" or "alloccopy"`, &selectAlgorithm,
                                "show-timing", "Show average iteration timing in microseconds", &show_timing,
                                "show-result", "Dump final array contents", &show_result,
                );
                if (getopt_result.helpWanted)
                {
                        defaultGetoptPrinter("Array rotation algorithm example benchmark", getopt_result.options);
                        return 0;
                }
        }
        catch (Exception ex)
        {
                io.stderr.writeln(ex.msg);
                io.stderr.writef("Run\n%s --help\n for usage\n", args[0]);
                return 1;
        }

        auto buffer = iota(0, 1<<log2_buffer_size).array();

        StopWatch timer;
        timer.start();
        foreach (_; 0..iterations)
        {
                rotate(buffer, delta);
        }
        timer.stop();

        if (show_timing)
        {
                auto total_nsecs = timer.peek.nsecs;
                auto average_us = cast(real)total_nsecs / iterations / 1000;
                io.writef("%10.2f\n", average_us);
        }

        if (show_result)
        {
                io.writef("%(%9d\n%)\n", buffer);
        }
        return 0;
}