Once upon a time I wrote a super-optimised algorithm for rotating data in an array. At least, it was meant to be super-optimised, but its real-world performance turned out to be terrible. That’s because my intuition about performance was stuck in the 20th century:

1. Break a program down into basic operations like multiplication and assignment
2. Give each operation a cost (or just slap on an O(1) if you’re lazy)
3. Add up all the numbers
4. Try to make the total in step #3 small

A lot of textbooks still teach this “classic” thinking, but except for some highly constrained embedded systems, it just doesn’t work that well on modern hardware.

The original version of this algorithm was in C, but I’m going to demonstrate my failure using D to keep the code simpler.

## The Problem

We need a routine that can rotate an array. A rotation by 1 shifts all elements across one place, with the element that “falls off” the end being put back onto the other end. A rotation by $d$ is like doing the same thing $d$ times, but we’d like to implement it as one transformation if possible. I’m defining rotations to be to the “left”, so that a rotation by $d$ puts the original (0-based) element number $d$ at the front.

Note that the shift distance can be larger than the array itself. If the array has length $N$, then every shift by a full $N$ gets us back to the original array, so we can simplify large shifts by just dividing by $N$ and taking the remainder (using the % operator in D).

## Baseline Solution: Allocate and Copy

Rotating the array is the same as splitting the array at a point and swapping the two pieces. (If you’re not sure about that, grab some playing cards and try it out — seriously, it’s a great way to grok any array transformation.) One simple way to implement that is to allocate some temporary working space, copy the data into it in the right order, then copy the data back into the original buffer. Here’s a no-brains translation of that into code:

I’m deliberately pessimising the memory management here by allocating a new temporary buffer that’s only freed on garbage collection. If I did the more natural thing and freed the buffer at the end of the function, and then benchmarked rotations in a tight loop, I could expect the same temporary buffer to just bounce on and off the underlying malloc implementation’s free list, which wouldn’t be giving my “optimised” solution a fair chance to shine.

## The “Optimised” Solution

There are two obvious targets for speeding up the baseline solution: the heap memory allocation and the wasteful copying. We need at least some temporary working space because otherwise we can’t move any data around without clobbering other data. But all we need is space for one element. This algorithm is theoretically interesting, but practically really bad, so I’ll describe it in full, but skip the details if you like.

Say we start at the first element. We copy that into the working space because there’s nowhere else it can go, which leaves the first element of the array free to be overwritten. There’s some element that’s meant to end up there, so we might as well shift it over now, leaving space for another element to move into its place. In this way, we can keep working around the array, jumping by $d$ places at each step, possibly wrapping around the end of the array a few times. Eventually we’ll copy the original element from temporary storage into its proper place.

Are we done at this point? Not quite. If we use a bit of number theory, we can show that the above hopping around will create a cycle containing $N/\gcd(N, d)$ elements (where $\gcd$ is the greatest common divisor), so we need to run $\gcd(N,d)$ cycles starting at different appropriate locations. Conveniently, the same number theory proves that once we get back to where we started, the next element along in the array is a starting point for the next loop.

If the array length is a power of two, we can replace all the more complex calculations with super fast bit manipulations. Since this is meant to be a super fast algorithm, let’s make that restriction.

This is obviously a lot more complicated and less “stupid” than the first version. How does it perform? Before looking at the benchmarks, let’s look at one more algorithm.

## An Elegant Solution

After writing that algorithm, I came across another approach. I mentioned earlier that rotating an array is the same as splitting the array and swapping the pieces. Reversing the array would also swap the order of the pieces, but of course the elements inside the pieces would be in reverse order, too. That’s easily fixed by doing extra reversals of the pieces themselves. Reversing can be done in place, so no allocation is required.

It’s even simpler than the first version. When I first saw this algorithm, I thought it was cute but obviously much less efficient than the minimal copy approach (it does even more copying than the allocate-and-copy approach). Thankfully, I tried it out.

## The Benchmarks

I’ve reproduced some benchmarks using the D code above:

The rotation by reversal algorithm has rock-solid linear performance above about 64 elements. Under ~1Ki elements, the allocate-and-copy algorithm time is dominated by the allocation. Its performance wobbles around linear growth but eventually is about twice as slow as the reversal algorithm. The minimal copy algorithm has similar performance to the reversal algorithm above a few hundred elements, but above 2Mi elements it diverges and becomes ten times as slow. It’s never faster than the reversal algorithm.

Please don’t take these benchmarks too seriously (remember, the allocate-and-copy implementation was really stupidly coded). The point is that the super “optimised” algorithm isn’t better than two algorithms that are really easy to implement and verify. In fact, it’s an order of magnitude slower for large arrays.

## The Postmortem

The simple reason the “optimisations” didn’t make the code faster is that it’s bad at a higher level than I was optimising. Bit operations won’t make bogosort faster than quicksort. As I said in the beginning, my thinking was stuck in the 20th century. The “classic” approach of adding up total operation cost works for a machine that crunches operations one at a time, but most computers haven’t worked like that in decades. I’m not just talking about multi-core processors. Pipelines, memory caching, prefetching, out-of-order processing, microcode architectures, write buffering, SIMD, etc, all add a kind of parallelism to modern computers. But this parallelism needs smooth, predictable program flow to work efficiently.

Computer performance has pretty much done a 180 turn. After Big O, the big thing in performance was maximising entropy. An if statement that’s almost always true is almost redundant, so making everything as unpredictable as possible minimises redundancy and makes every operation count. That was the 20th century. Now we need to make operations as predictable as possible, to maximise the amount of work the CPU can do.

The two good algorithms process memory in very simple patterns (mostly straight sweeps). The bad algorithm jumps all over the place.

To get a more concrete picture of what the performance problems are, we need to use a profiler designed for modern optimisation. This will give us counts of “events” that disrupt the CPU’s processing of the code. The easiest tool for Linux-based systems is perf. This will dump a bunch of performance statistics for the rotate benchmark program:

There are more statistics available. To list them:

Then you can get the specific stats with something like this:

Cache misses and page faults are the most interesting stats here, which you might suspect from experience, or just guess by looking at all the stats. Because of its irregular memory access patterns, the minimal copy algorithm suffers from a lot more memory cache misses than the other two. The allocate-and-copy algorithm suffers more cache misses than the reversal algorithm, as well as more page faults. This is probably because it touches more “cold” memory that hasn’t been used recently — because of both allocating new memory and triggering garbage collections — and touches more memory overall.

Cache miss and page fault counts per rotation algorithm for arrays of 1Mi (1048576) integers. The results have been averaged over ten iterations.
Algorithm Cache Misses Page Faults
Reversal 3,340,210 1653
Allocate and Copy 5,061,042 18,046
Minimal Copy 25,839,824 1653

## Source Code

If you want to try it out for yourself (there are several ways to make the allocate-and-copy algorithm faster) here’s the full code listing. I compiled the code for the benchmarks with LDC, the LLVM-backed D compiler:

The default output is the average time in microseconds.