Recently I’ve been playing around with using bpftrace to trace and profile D’s garbage collector. Here are some examples of the cool stuff that’s possible.

## What is bpftrace?

It’s a high-level debugging tool based on Linux’s eBPF. “eBPF” stands for “extended Berkely packet filter”, but that’s just a historical name and doesn’t mean much today. It’s really a virtual machine (like the JVM) that sits inside the Linux kernel and runs code in a special eBPF instruction set similar to normal machine code. Users are expected to write short programs in high-level languages (including C and others) that get compiled to eBPF and loaded into the kernel on the fly to do interesting things.

As you might guess, eBPF is powerful for instrumenting a running kernel, but it also supports instrumenting user-space programs.

## What you need

First you need a Linux kernel. Sorry BSD, Mac OS and Windows users. (But some of you can use DTrace.)

Also, not just any Linux kernel will work. This stuff is relatively new, so you’ll need a modern kernel with BPF-related features enabled. You might need to use the newest (or even testing) version of a distro. Here’s how to check if your kernel meets the requirements:

## bpftrace D “Hello World”

Here’s a quick test you can do to make sure you’ve got everything working. First, let’s make a Hello World D binary:

Now let’s bpftrace it. bpftrace uses a high-level language that’s obviously inspired by AWK. I’ll explain enough to understand the post, but you can also check out the bpftrace reference guide and one-liner tutorial. The minimum you need to know is that a bpftrace program is a list of event:name /filter predicate/ { program(); code(); } blocks that define code snippets to be run on events.

This time I’m only using Linux uprobes, which trigger on functions in user-space programs. The syntax is uprobe:/path/to/binary:functionName. One gotcha is that D “mangles” (encodes) function names before inserting them into the binary. If we want to trigger on the D code’s main() function, we need to use the mangled name: _Dmain. (By the way, nm program | grep ' _D.*functionName' is one quick trick for finding mangled names.)

Run this bpftrace invocation in a terminal as root user:

While this is running, it’ll print a message every time the D Hello World program is executed by any user in any terminal. Press Ctrl+C to quit.

All bpftrace code can be run directly from the command line like in the example above. But to make things easier to read from now on, I’ll make neatly formatted scripts.

## Tracing some real code

I’m using D-Scanner, the D code analyser, as an example of a simple but non-trivial D workload. One nice thing about bpftrace and uprobes is that no modification of the program is needed. I’m just using a normal build of the dscanner tool, and using the D runtime source code as a codebase to analyse.

Before using bpftrace, let’s try using the profiling that’s built into the D GC implementation itself:

(If you can make a custom build, you can also use the D runtime GC API to get stats.)

There’s one more gotcha when using bpftrace on dscanner to trace GC functions: the binary file we specify for the uprobe needs to be the binary file that actually contains the GC functions. That could be the D binary itself, or it could be a shared D runtime library. Try running ldd /path/to/d_program to list any linked shared libraries, and if the output contains druntime, use that full path when specifying uprobes. My dscanner binary doesn’t link to a shared D runtime, so I just use the full path to dscanner. (Running which dscanner gives /usr/local/bin/dscanner for me.)

Anyway, all the GC functions live in a gc module, so their mangled names start with _D2gc. Here’s a bpftrace invocation that tallies GC function calls. For convenience, it also includes a uretprobe to automatically exit when main() returns. The output is sorted to make it a little easier to read.

All these functions are in src/gc/, and most of the interesting ones here are in src/gc/impl/conservative/. There are 85 calls to _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm, which ddemangle translates to nothrow ulong gc.impl.conservative.gc.Gcx.fullcollect(bool). That matches up with the report from --DRT-gcopt=profile:1.

The heart of the bpftrace program is @[probe] = count();. @ prefixes a global variable, in this case a variable with an empty name (allowed by bpftrace). We’re using the variable as a map (like an associative array in D), and indexing it with probe, a built-in variable containing the name of the uprobe that was triggered. The tally is kept using the magic count() function.

## Garbage collection timings

How about something more interesting, like generating a profile of collection timings? This time, to get more data, I won’t make bpftrace exit as soon as the dscanner exits. I’ll keep it running and run dscanner 100 times before quitting bpftrace with Ctrl+C:

Et voila! A log-scale histogram of the nsecs timestamp difference between entering and exiting fullcollect(). The times are in nanoseconds, so we see that most collections are taking less than half a millisecond, but we have tail cases that take 1-2ms.

## Function arguments

bpftrace provides arg0, arg1, arg2, etc. built-in variables for accessing the arguments to a traced function. There are a couple of complications with using them with D code, however.

The first is that (at the binary level) dmd makes extern(D) functions (i.e., normal D functions) take arguments in the reverse order of extern(C) functions (that bpftrace is expecting). Suppose you have a simple three-argument function. If it’s using the C calling convention, bpftrace will recognise the first argument as arg0. If it’s using the D calling convention, however, it’ll be picked up as arg2.

If you look at the D ABI spec, you’ll notice that (just like in C++) there can be a couple of hidden arguments if the function is more complex. If dfunc above returned a large struct, there can be an extra hidden argument for implementing copy elision, which means the first argument would actually be arg3, and arg0 would be the hidden argument. If dfunc were also a member function, it would have a hidden this argument, which would bump up the first argument to arg4.

To get the hang of this, you might need to experiment with tracing function calls with known arguments.

## Allocation sizes

Let’s get a histogram of the memory allocation request sizes. Looking at the list of GC functions traced earlier, and comparing it with the GC source code, it looks like we need to trace these functions and grab the size argument:

As class member functions, they have a hidden this argument as well. The last one, qalloc(), returns a struct, so it also has a hidden argument for copy elision. So size is arg3 for the first two functions, and arg4 for qalloc(). Time to run a trace:

So, we have a lot of small allocations, with a very long tail of larger allocations. Remember, size is on a log scale, so that long tail represents a very skewed distribution.

## Small allocation hotspots

Now for something more complex. Suppose we’re profiling our code and looking for low-hanging fruit for reducing the number of memory allocations. Code that makes a lot of small allocations tends to be a good candidate for this kind of refactoring. bpftrace lets us grab stack traces, which can be used to see what part of the main program caused an allocation.

As of writing, there’s one little complication because of a limitation of bpftrace’s stack trace handling: it can only show meaningful function symbol names (as opposed to raw memory addresses) if bpftrace quits while the target program is still running. There’s an open bug report for improving this behaviour, but in the meantime I just made sure dscanner took a long time, and that I shut down bpftrace first.

Here’s how to grab the top three stack traces that lead to small (<16B) memory allocations with qalloc():

It looks like a lot of the small allocations are due to a red-black tree in ModuleCache.

## What’s next?

I think these examples already show that bpftrace is a pretty powerful tool. There’s a lot more that can done, and I highly recommended reading Brendan Gregg’s eBPF tutorials.

I used uprobes to trace arbitrary functions in the D runtime. The pro of this is the freedom to do anything, but the cons are that I had to refer to the D runtime source code and manually deal with the D ABI. There’s also no guarantee that a script I write today will work with future versions of the runtime. Linux also supports making well-defined tracepoints in user code using a feature called USDT. That should let D code export stable tracepoints that can be used without worrying about the D ABI. I might do more experiments in future.