KLEE is a symbolic execution engine that can rigorously verify or find bugs in software. It’s designed for C and C++, but it’s just an interpreter for LLVM bitcode combined with theorem prover backends, so it can work with bitcode generated by ldc2. One catch is that it needs a compatible bitcode port of the D runtime to run normal D code. I’m still interested in getting KLEE to work with normal D code, but for now I’ve done some experiments with -betterC D.

## How KLEE works

What makes KLEE special is its support for two kinds of variables: concrete and symbolic. Concrete variables are just like the normal variables in normal code: they have a deterministic value at any given point in the program. On the other hand, symbolic variables contain a bundle of logical constraints instead of values. Take this code:

klee_int("x") creates a symbolic integer that will be called “x” in output reports. Initially it has no contraints and can imply any value that a 32b signed integer can have. klee_assume(x >= 0) tells KLEE to add x >= 0 as a constraint, so now we’re only analysing the code for non-negative 32b signed integers. On hitting the if, KLEE checks if both branches are possible. Sure enough, x > 42 can be true or false even with the constraint x >= 0, so KLEE has to fork. We now have two processes being interpreted on the VM: one executing doA() while x holds the constraints x >= 0, x > 42, and another executing doB() while x holds the contraints x >= 0, x <= 42. The second process will hit the assert statement, and KLEE will try to prove or disprove 3 * x != 21 using the assumptions x >= 0, x <= 42 — in this case it will disprove it and report a bug with x = 7 as a crashing example.

## First steps

Here’s a toy example just to get things working. Suppose we have a function that makes an assumption for a performance optimisation. Thankfully the assumption is made explicit with assert and is documented with a comment. Is the assumption valid?

Here’s a KLEE test rig. The KLEE function declarations and the main() entry point need to have extern(C) linkage, but anything else can be normal D code as long as it compiles under -betterC:

It turns out there’s just one (frustrating) complication with running -betterC D under KLEE. In D, assert is handled specially by the compiler. By default, it throws an Error, but for compatibility with KLEE, I’m using the -checkaction=C flag. In C, assert is usually a macro that translates to code that calls some backend implementation. That implementation isn’t standardised, so of course various C libraries work differently. ldc2 actually has built-in logic for implementing -checkaction=C correctly depending on the C library used.

KLEE uses a port of uClibc, which translates assert() to a four-parameter __assert() function, which conflicts with the three-parameter __assert() function in other implementations. ldc2 uses LLVM’s (target) Triple type for choosing an assert() implementation configuration, but that doesn’t recognise uClibc. As a hacky workaround, I’m telling ldc2 to compile for Musl, which “tricks” it into using an __assert_fail() implementation that KLEE happens to support as well. I’ve opened an issue report.

Anyway, if we put all that code above into a file, we can compile it to KLEE-ready bitcode like this:

-g is optional, but adds debug information that can be useful for later analysis. The KLEE developers recommend disabling compiler optimisations and letting KLEE do its own optimisations instead.

Now to run KLEE:

Straight away, KLEE has found two execution paths through the program: a happy path, and a path that fails the assertion. Let’s see the results:

Here’s the example that triggers the happy path:

Here’s the example that causes an assertion error:

So, KLEE has deduced that when x is 869476073, x * x does a 32b overflow to 17 and breaks the code.

It’s overkill for this simple example, but run.istats can be opened with KCachegrind to view things like call graphs and source code coverage. (Unfortunately, coverage stats can be misleading because correct code won’t ever hit boundary check code inserted by the compiler.)

## MurmurHash preimage

Here’s a slightly more useful example. D currently uses 32b MurmurHash3 as its standard non-cryptographic hash function. What if we want to find strings that hash to a given special value? In general, we can solve problems like this by asserting that something doesn’t exist (i.e., a string that hashes to a given value) and then challenging the theorem prover to prove us wrong with a counterexample.

Unfortunately, we can’t just use hashOf() directly without the runtime, but we can copy the hash code from the runtime source into its own module, and then import it into a test rig like this:

Here’s how to compile and run it. Because we’re not checking correctness, we can use -boundscheck=off for a slight performance boost. It’s also worth enabling KLEE’s optimiser.

It takes just over 4s:

And it actually works:

For comparison, here’s a simplistic brute force version in plain D:

This takes ~17s:

The constraint solver implementation is simpler to write, but is still faster because it can automatically do smarter things than calculating hashes of strings from scratch every iteration.

Now for an example of testing and debugging. Here’s an implementation of binary search:

Does it work? Here’s a test rig:

When run in KLEE, it keeps running for a long, long time. How do we know it’s doing anything? By default KLEE writes stats every 1s, so we can watch the live progress in another terminal:

bsearch() should be pretty fast, so we should see KLEE discovering new states rapidly. But instead it seems to be stuck. At least one fork of KLEE has heuristics for detecting infinite loops, but plain KLEE doesn’t. There are timeout and batching options for making KLEE work better with code that might have infinite loops, but let’s just take another look at the code. In particular, the loop condition:

Binary search is supposed to reduce the search space by about half each iteration. haystack.length is an unsigned integer, so the loop must terminate as long as it goes down every iteration. Let’s rewrite the code slightly so we can verify if that’s true:

Now KLEE can find the bug!

Using the failing example as input and stepping through the code, it’s easy to find the problem:

Thinking about it, the if statement already excludes haystack[mid_idx] from being needle, so there’s no reason to include it in next_haystack. Here’s the fix:

But is the code correct now? Terminating isn’t enough; it needs to get the right answer, of course.

In just under 7s, KLEE has verified every possible execution path reachable with arrays of length from 0 to 8. Note, that’s not just coverage of individual code lines, but coverage of full pathways through the code. KLEE hasn’t ruled out stack corruption or integer overflows with large arrays, but I’m pretty confident the code is correct now.

KLEE has generated test cases that trigger each path, which we can keep and use as a faster-than-7s regression test suite. Trouble is, the output from KLEE loses all type information and isn’t in a convenient format:

But we can write our own pretty-printing code and put it at the end of the test rig:

Ugh, that would be just one format call with D’s %( array formatting specs. The output needs to be buffered up and printed all at once to stop output from different parallel executions getting mixed up. klee_get_value_i32() is needed to get a concrete example from a symbolic variable (remember that a symbolic variable is just a bundle of constraints).

Nice! An autogenerated regression test suite that’s better than anything I would write by hand. This is my favourite use case for KLEE.

## Change counting

One last example:

In Australia, coins come in 5c, 10c, 20c, 50c, $1 (100c) and$2 (200c) denominations. So you can make 70c using 14 5c coins, or using a 50c coin and a 20c coin. Obviously, fewer coins is usually more convenient. There’s a simple greedy algorithm to make a small pile of coins that adds up to a given value: just keep adding the biggest coin you can to the pile until you’ve reached the target value. It turns out this trick is optimal — at least for Australian coins. Is it always optimal for any set of coin denominations?

The hard thing about testing optimality is that you don’t know what the correct optimal values are without a known-good algorithm. Without a constraints solver, I’d compare the output of the greedy algorithm with some obviously correct brute force optimiser, run over all possible cases within some small-enough limit. But with KLEE, we can use a different approach: comparing the greedy solution to a non-deterministic solution.

The greedy algorithm takes the list of coin denominations and the target value as input, so (like in the previous examples) we make those symbolic. Then we make another symbolic array that represents an assignment of coin counts to each coin denomination. We don’t specify anything about how to generate this assignment, but we constrain it to be a valid assignment that adds up to the target value. It’s non-deterministic. Then we just assert that the total number of coins in the non-deterministic assignment is at least the number of coins needed by the greedy algorithm, which would be true if the greedy algorithm were universally optimal. Finally we ask KLEE to prove the program correct or incorrect.

Here’s the code:

And here’s the counterexample it found after 14s:

Note that this isn’t proven to be the new optimum; it’s just a witness that the greedy algorithm isn’t always optimal. There’s a well-known dynamic programming solution that always works.

## What’s next?

As I said, I’m interesting in getting this to work with full D code. I’m also interested in using one of the floating point forks of KLEE on some D because floating point is much harder to test thoroughly than integer and string code.