KLEE is symbolic execution engine that can rigorously verify or find
bugs in software. It’s designed for C and C++, but it’s just an interpreter for LLVM bitcode combined with
theorem prover backends, so it can work with bitcode generated by ldc2. One catch is that it needs a
compatible bitcode port of the D runtime to run normal D code. I’m still interested in getting KLEE to work with
normal D code, but for now I’ve done some experiments with -betterC D.
How KLEE works
What makes KLEE special is its support for two kinds of variables: concrete and symbolic. Concrete variables
are just like the normal variables in normal code: they have a deterministic value at any given point in the
program. On the other hand, symbolic variables contain a bundle of logical constraints instead of values. Take
klee_int("x") creates a symbolic integer that will be called “x” in output reports.
Initially it has no contraints and can imply any value that a 32b signed integer can have. klee_assume(x
>= 0) tells KLEE to add x >= 0 as a constraint, so now we’re only analysing the code
for non-negative 32b signed integers. On hitting the if, KLEE checks if both branches are possible.
Sure enough, x > 42 can be true or false even with the constraint x >= 0, so KLEE
has to fork. We now have two processes being interpreted on the VM: one executing doA()
while x holds the constraints x >= 0, x > 42, and another executing
doB() while x holds the contraints x >= 0, x <= 42. The second
process will hit the assert statement, and KLEE will try to prove or disprove 3 * x !=
21 using the assumptions x >= 0, x <= 42 — in this case it will disprove it and report
a bug with x = 7 as a crashing example.
Here’s a toy example just to get things working. Suppose we have a function that makes an assumption for a
performance optimisation. Thankfully the assumption is made explicit with assert and is documented
with a comment. Is the assumption valid?
Here’s a KLEE test rig. The KLEE function declarations and the main() entry point need to have
extern(C) linkage, but anything else can be normal D code as long as it compiles under
It turns out there’s just one (frustrating) complication with running -betterC D under KLEE. In
D, assert is handled specially by the compiler. By default, it throws an Error, but for
compatibility with KLEE, I’m using the -checkaction=C flag. In C, assert is usually a
macro that translates to code that calls some backend implementation. That implementation isn’t standardised, so
of course various C libraries work differently. ldc2 actually has built-in logic for implementing
-checkaction=C correctly depending on the C library used.
KLEE uses a port of uClibc, which translates assert() to a
four-parameter __assert() function, which conflicts with the three-parameter __assert()
function in other implementations. ldc2 uses LLVM’s (target) Triple type for choosing
an assert() implementation configuration, but that doesn’t recognise uClibc. As a hacky workaround,
I’m telling ldc2 to compile for Musl, which “tricks” it into using an __assert_fail()
implementation that KLEE happens to support as well. I’ve opened an issue report.
Anyway, if we put all that code above into a file, we can compile it to KLEE-ready bitcode like this:
-g is optional, but adds debug information that can be useful for later analysis. The KLEE
developers recommend disabling compiler optimisations and letting KLEE do its own optimisations instead.
Now to run KLEE:
Straight away, KLEE has found two execution paths through the program: a happy path, and a path that fails the
assertion. Let’s see the results:
Here’s the example that triggers the happy path:
Here’s the example that causes an assertion error:
So, KLEE has deduced that when x is 869476073, x * x does a 32b overflow to 17 and
breaks the code.
It’s overkill for this simple example, but run.istats can be opened with KCachegrind to view things like call graphs and source code
coverage. (Unfortunately, coverage stats can be misleading because correct code won’t ever hit boundary check
code inserted by the compiler.)
Here’s a slightly more useful example. D currently uses 32b MurmurHash3 as its standard non-cryptographic hash
function. What if we want to find strings that hash to a given special value? In general, we can solve problems
like this by asserting that something doesn’t exist (i.e., a string that hashes to a given value) and then
challenging the theorem prover to prove us wrong with a counterexample.
Unfortunately, we can’t just use hashOf() directly without the runtime, but we can copy the
hash code from the runtime source into its own module, and then import it into a test rig like this:
Here’s how to compile and run it. Because we’re not checking correctness, we can use
-boundscheck=off for a slight performance boost. It’s also worth enabling KLEE’s optimiser.
It takes just over 4s:
And it actually works:
For comparison, here’s a simple brute force version in plain D:
This takes ~17s:
The constraint solver implementation is simpler to write, but is still faster because it can automatically do
smarter things than calculating hashes of strings from scratch every iteration.
Now for an example of testing and debugging. Here’s an implementation of binary search:
Does it work? Here’s a test rig:
When run in KLEE, it keeps running for a long, long time. How do we know it’s doing anything? By default KLEE
writes stats every 1s, so we can watch the live progress in another terminal:
bsearch() should be pretty fast, so we should see KLEE discovering new states rapidly. But
instead it seems to be stuck. At least one fork of
KLEE has heuristics for detecting infinite loops, but plain KLEE doesn’t. There are timeout and batching
options for making KLEE work better with code that might have infinite loops, but let’s just take another look at
the code. In particular, the loop condition:
Binary search is supposed to reduce the search space by about half each iteration.
haystack.length is an unsigned integer, so the loop must terminate as long as it goes down every
iteration. Let’s rewrite the code slightly so we can verify if that’s true:
Now KLEE can find the bug!
Using the failing example as input and stepping through the code, it’s easy to find the problem:
Thinking about it, the if statement already excludes haystack[mid_idx] from being
needle, so there’s no reason to include it in next_haystack. Here’s the fix:
But is the code correct now? Terminating isn’t enough; it needs to get the right answer, of course.
In just under 7s, KLEE has verified every possible execution path reachable with arrays of length from 0 to 8.
Note, that’s not just coverage of individual code lines, but coverage of full pathways through the code. KLEE
hasn’t ruled out stack corruption or integer overflows with large arrays, but I’m pretty confident the code is
KLEE has generated test cases that trigger each path, which we can keep and use as a faster-than-7s regression
test suite. Trouble is, the output from KLEE loses all type information and isn’t in a convenient format:
But we can write our own pretty-printing code and put it at the end of the test rig:
Ugh, that would be just one format call with D’s %( array formatting specs. The output needs to
be buffered up and printed all at once to stop output from different parallel executions getting mixed up.
klee_get_value_i32() is needed to get a concrete example from a symbolic variable (remember that a
symbolic variable is just a bundle of constraints).
Nice! An autogenerated regression test suite that’s better than anything I would write by hand. This is my
favourite use case for KLEE.
One last example:
In Australia, coins come in 5c, 10c, 20c, 50c, $1 (100c) and $2 (200c) denominations. So you can make 70c
using 14 5c coins, or using a 50c coin and a 20c coin. Obviously, fewer coins is usually more convenient. There’s
a simple greedy algorithm to make a small pile of
coins that adds up to a given value: just keep adding the biggest coin you can to the pile until you’ve reached
the target value. It turns out this trick is optimal — at least for Australian coins. Is it always optimal for
any set of coin denominations?
The hard thing about testing optimality is that you don’t know what the correct optimal values are without a
known-good algorithm. Without a constraints solver, I’d compare the output of the greedy algorithm with some
obviously correct brute force optimiser, run over all possible cases within some small-enough limit. But with
KLEE, we can use a different approach: comparing the greedy solution to a non-deterministic solution.
The greedy algorithm takes the list of coin denominations and the target value as input, so (like in the
previous examples) we make those symbolic. Then we make another symbolic array that represents an assignment of
coin counts to each coin denomination. We don’t specify anything about how to generate this assignment, but we
constrain it to be a valid assignment that adds up to the target value. It’s non-deterministic. Then we just assert that the total
number of coins in the non-deterministic assignment is at least the number of coins needed by the greedy
algorithm, which would be true if the greedy algorithm were universally optimal. Finally we ask KLEE to prove the
program correct or incorrect.
Here’s the code:
And here’s the counterexample it found after 14s:
Note that this isn’t proven to be the new optimum; it’s just a witness that the greedy algorithm isn’t always
optimal. There’s a well-known dynamic
solution that always works.
As I said, I’m interesting in getting this to work with full D code. I’m also interested in using one of the floating point forks of KLEE on some D because
floating point is much harder to test thoroughly than integer and string code.