KLEE is a symbolic execution engine that can rigorously verify or find bugs in
software. It’s designed for C and C++, but it’s just an interpreter for LLVM bitcode combined with theorem prover
backends, so it can work with bitcode generated by ldc2. One
catch is that it needs a compatible bitcode port of the D runtime to run normal D code. I’m still interested in getting
KLEE to work with normal D code, but for now I’ve done some experiments with -betterC D.
How KLEE works
What makes KLEE special is its support for two kinds of variables: concrete and symbolic. Concrete variables are
just like the normal variables in normal code: they have a deterministic value at any given point in the program. On
the other hand, symbolic variables contain a bundle of logical constraints instead of values. Take this code:
klee_int("x") creates a symbolic integer that will be
called “x” in output reports. Initially it has no contraints
and can imply any value that a 32b signed integer can have. klee_assume(x >= 0) tells KLEE to add x >= 0 as a constraint, so now we’re only analysing the code for
non-negative 32b signed integers. On hitting the if, KLEE
checks if both branches are possible. Sure enough, x > 42
can be true or false even with the constraint x >= 0, so
KLEE has to fork. We now have two processes being interpreted on the VM: one executing doA() while x
holds the constraints x >= 0, x > 42, and another
executing doB() while x holds the contraints x >= 0, x <= 42. The second process will hit the assert statement, and KLEE will try to prove or disprove 3 * x != 21 using the assumptions x >= 0, x <= 42 — in this case it will disprove it and report a bug
with x = 7 as a crashing example.
First steps
Here’s a toy example just to get things working. Suppose we have a function that makes an assumption for a
performance optimisation. Thankfully the assumption is made explicit with assert and is documented with a comment. Is the assumption valid?
Here’s a KLEE test rig. The KLEE function declarations and the main() entry point need to have extern(C) linkage, but anything else can be normal D code as long as it
compiles under -betterC:
It turns out there’s just one (frustrating) complication with running -betterC D under KLEE. In D, assert is handled specially by the compiler. By default, it throws an
Error, but for compatibility with KLEE, I’m using the
-checkaction=C flag. In C, assert is usually a macro that translates to code that calls some backend
implementation. That implementation isn’t standardised, so of course various C libraries work differently. ldc2 actually has built-in logic for implementing -checkaction=C correctly depending on the C library used.
KLEE uses a port of uClibc, which translates assert() to a four-parameter __assert() function, which conflicts with the three-parameter
__assert() function in other implementations. ldc2 uses LLVM’s (target) Triple type for choosing an assert() implementation configuration, but that doesn’t recognise uClibc.
As a hacky workaround, I’m telling ldc2 to compile for Musl,
which “tricks” it into using an __assert_fail()
implementation that KLEE happens to support as well. I’ve opened an issue report.
Anyway, if we put all that code above into a file, we can compile it to KLEE-ready bitcode like this:
-g is optional, but adds debug information that can be
useful for later analysis. The KLEE developers recommend disabling compiler optimisations and letting KLEE do its own
optimisations instead.
Now to run KLEE:
Straight away, KLEE has found two execution paths through the program: a happy path, and a path that fails the
assertion. Let’s see the results:
Here’s the example that triggers the happy path:
Here’s the example that causes an assertion error:
So, KLEE has deduced that when x is 869476073,
x * x does a 32b overflow to 17 and breaks the code.
It’s overkill for this simple example, but run.istats can
be opened with KCachegrind to view things like call graphs
and source code coverage. (Unfortunately, coverage stats can be misleading because correct code won’t ever hit boundary
check code inserted by the compiler.)
MurmurHash preimage
Here’s a slightly more useful example. D currently uses 32b MurmurHash3 as its standard non-cryptographic hash
function. What if we want to find strings that hash to a given special value? In general, we can solve problems like
this by asserting that something doesn’t exist (i.e., a string that hashes to a given value) and then challenging the
theorem prover to prove us wrong with a counterexample.
Unfortunately, we can’t just use hashOf() directly without
the runtime, but we can copy the
hash code from the runtime source into its own module, and then import it into a test rig like this:
Here’s how to compile and run it. Because we’re not checking correctness, we can use -boundscheck=off for a slight performance boost. It’s also worth enabling
KLEE’s optimiser.
It takes just over 4s:
And it actually works:
For comparison, here’s a simplistic brute force version in plain D:
This takes ~17s:
The constraint solver implementation is simpler to write, but is still faster because it can automatically do
smarter things than calculating hashes of strings from scratch every iteration.
Binary search
Now for an example of testing and debugging. Here’s an implementation of binary search:
Does it work? Here’s a test rig:
When run in KLEE, it keeps running for a long, long time. How do we know it’s doing anything? By default KLEE writes
stats every 1s, so we can watch the live progress in another terminal:
bsearch() should be pretty fast, so we should see KLEE
discovering new states rapidly. But instead it seems to be stuck. At least one fork of KLEE has heuristics for detecting infinite
loops, but plain KLEE doesn’t. There are timeout and batching options for making KLEE work better with code that
might have infinite loops, but let’s just take another look at the code. In particular, the loop condition:
Binary search is supposed to reduce the search space by about half each iteration. haystack.length is an unsigned integer, so the loop must terminate as
long as it goes down every iteration. Let’s rewrite the code slightly so we can verify if that’s true:
Now KLEE can find the bug!
Using the failing example as input and stepping through the code, it’s easy to find the problem:
Thinking about it, the if statement already excludes
haystack[mid_idx] from being needle, so there’s no reason to include it in next_haystack. Here’s the fix:
But is the code correct now? Terminating isn’t enough; it needs to get the right answer, of course.
In just under 7s, KLEE has verified every possible execution path reachable with arrays of length from 0 to 8. Note,
that’s not just coverage of individual code lines, but coverage of full pathways through the code. KLEE hasn’t ruled
out stack corruption or integer overflows with large arrays, but I’m pretty confident the code is correct now.
KLEE has generated test cases that trigger each path, which we can keep and use as a faster-than-7s regression test
suite. Trouble is, the output from KLEE loses all type information and isn’t in a convenient format:
But we can write our own pretty-printing code and put it at the end of the test rig:
Ugh, that would be just one format call with D’s %( array
formatting specs. The output needs to be buffered up and printed all at once to stop output from different parallel
executions getting mixed up. klee_get_value_i32() is needed
to get a concrete example from a symbolic variable (remember that a symbolic variable is just a bundle of
constraints).
Nice! An autogenerated regression test suite that’s better than anything I would write by hand. This is my favourite
use case for KLEE.
Change counting
One last example:
In Australia, coins come in 5c, 10c, 20c, 50c, $1 (100c) and $2 (200c) denominations. So you can make 70c using 14
5c coins, or using a 50c coin and a 20c coin. Obviously, fewer coins is usually more convenient. There’s a simple
greedy algorithm to make a small pile of coins that adds
up to a given value: just keep adding the biggest coin you can to the pile until you’ve reached the target value. It
turns out this trick is optimal — at least for Australian coins. Is it always optimal for any set of coin
denominations?
The hard thing about testing optimality is that you don’t know what the correct optimal values are without a
known-good algorithm. Without a constraints solver, I’d compare the output of the greedy algorithm with some obviously
correct brute force optimiser, run over all possible cases within some small-enough limit. But with KLEE, we can use a
different approach: comparing the greedy solution to a non-deterministic solution.
The greedy algorithm takes the list of coin denominations and the target value as input, so (like in the previous
examples) we make those symbolic. Then we make another symbolic array that represents an assignment of coin counts to
each coin denomination. We don’t specify anything about how to generate this assignment, but we constrain it to be a
valid assignment that adds up to the target value. It’s non-deterministic. Then we just assert that the
total number of coins in the non-deterministic assignment is at least the number of coins needed by the greedy
algorithm, which would be true if the greedy algorithm were universally optimal. Finally we ask KLEE to prove the
program correct or incorrect.
Here’s the code:
And here’s the counterexample it found after 14s:
Note that this isn’t proven to be the new optimum; it’s just a witness that the greedy algorithm isn’t always
optimal. There’s a well-known dynamic
programmingsolution
that always works.
What’s next?
As I said, I’m interesting in getting this to work with full D code. I’m also interested in using one of the floating point forks of KLEE on some D because floating
point is much harder to test thoroughly than integer and string code.