Recently I’ve been playing around with using bpftrace to trace and profile D’s garbage collector. Here are some
examples of the cool stuff that’s possible.
What is bpftrace?
It’s a high-level debugging tool based on Linux’s eBPF. “eBPF” stands for “extended Berkely packet filter”, but
that’s just a historical name and doesn’t mean much today. It’s really a virtual machine (like the JVM) that sits inside the Linux kernel and runs code in a
special eBPF instruction set similar to normal machine code. Users are expected to write short programs in high-level
languages (including C and others) that get compiled to eBPF and loaded into the kernel on the fly to do interesting
things.
As you might guess, eBPF is powerful for instrumenting a running kernel, but it also supports instrumenting
user-space programs.
What you need
First you need a Linux kernel. Sorry BSD, Mac OS and Windows users. (But some of you can use DTrace.)
Also, not just any Linux kernel will work. This stuff is relatively new, so you’ll need a modern kernel with
BPF-related features enabled. You might need to use the newest (or even testing) version of a distro. Here’s how to
check if your kernel meets the requirements:
Here’s a quick test you can do to make sure you’ve got everything working. First, let’s make a Hello World D
binary:
Now let’s bpftrace it. bpftrace uses a high-level language that’s obviously inspired by AWK.
I’ll explain enough to understand the post, but you can also check out the bpftrace reference guide and one-liner tutorial. The minimum you
need to know is that a bpftrace program is a list of event:name
/filter predicate/ { program(); code(); } blocks that define code snippets to be run on events.
This time I’m only using Linux uprobes, which trigger on functions in user-space programs. The syntax is
uprobe:/path/to/binary:functionName. One gotcha is that D
“mangles” (encodes) function names before inserting them
into the binary. If we want to trigger on the D code’s main()
function, we need to use the mangled name: _Dmain. (By the
way, nm program | grep ' _D.*functionName' is one quick trick
for finding mangled names.)
Run this bpftrace invocation in a terminal as root
user:
While this is running, it’ll print a message every time the D Hello World program is executed by any user in any
terminal. Press Ctrl+C to quit.
All bpftrace code can be run directly from the command
line like in the example above. But to make things easier to read from now on, I’ll make neatly formatted scripts.
Tracing some real code
I’m using D-Scanner, the D code analyser, as an example
of a simple but non-trivial D workload. One nice thing about bpftrace and uprobes is that no modification of the program is needed.
I’m just using a normal build of the dscanner tool, and using
the D runtime source code as a codebase to analyse.
There’s one more gotcha when using bpftrace on
dscanner to trace GC functions: the binary file we specify
for the uprobe needs to be the binary file that actually contains the GC functions. That could be the D binary itself,
or it could be a shared D runtime library. Try running ldd
/path/to/d_program to list any linked shared libraries, and if the output contains druntime, use that full path when specifying uprobes. My dscanner binary doesn’t link to a shared D runtime, so I just use the
full path to dscanner. (Running which dscanner gives /usr/local/bin/dscanner for me.)
Anyway, all the GC functions live in a gc module, so their
mangled names start with _D2gc. Here’s a bpftrace invocation that tallies GC function calls. For convenience, it
also includes a uretprobe to automatically exit when main()
returns. The output is sorted to make it a little easier to read.
All these functions are in src/gc/, and most of the interesting ones here are in src/gc/impl/conservative/. There are 85 calls to _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm, which ddemangle translates to
nothrow ulong gc.impl.conservative.gc.Gcx.fullcollect(bool).
That matches up with the report from --DRT-gcopt=profile:1.
The heart of the bpftrace program is @[probe] = count();. @ prefixes a global variable, in this case a variable with an empty name
(allowed by bpftrace). We’re using the variable as a map
(like an associative array in D), and indexing it with probe,
a built-in variable containing the name of the uprobe that was triggered. The tally is kept using the magic
count() function.
Garbage collection timings
How about something more interesting, like generating a profile of collection timings? This time, to get more data,
I won’t make bpftrace exit as soon as the dscanner exits. I’ll keep it running and run dscanner 100 times before quitting bpftrace with Ctrl+C:
Et voila! A log-scale histogram of the nsecs timestamp
difference between entering and exiting fullcollect(). The
times are in nanoseconds, so we see that most collections are taking less than half a millisecond, but we have tail
cases that take 1-2ms.
Function arguments
bpftrace provides arg0, arg1,
arg2, etc. built-in variables for accessing the arguments to
a traced function. There are a couple of complications with using them with D code, however.
The first is that (at the binary level) dmd makes
extern(D) functions (i.e., normal D functions) take arguments
in the reverse order of extern(C) functions (that
bpftrace is expecting). Suppose you have a simple
three-argument function. If it’s using the C calling convention, bpftrace will recognise the first argument as arg0. If it’s using the D calling convention, however, it’ll be picked up
as arg2.
If you look at the D ABI spec, you’ll notice that (just
like in C++) there can be a couple of hidden arguments if the function is more complex. If dfunc above returned a large struct, there can be an extra hidden
argument for implementing copy elision, which means the first
argument would actually be arg3, and arg0 would be the hidden argument. If dfunc were also a member function, it would have a hidden this argument, which would bump up the first argument to arg4.
To get the hang of this, you might need to experiment with tracing function calls with known arguments.
Allocation sizes
Let’s get a histogram of the memory allocation request sizes. Looking at the list of GC functions traced earlier,
and comparing it with the GC source code, it looks like we need to trace these functions and grab the size argument:
As class member functions, they have a hidden this
argument as well. The last one, qalloc(), returns a struct,
so it also has a hidden argument for copy elision. So size is
arg3 for the first two functions, and arg4 for qalloc(). Time to run a trace:
So, we have a lot of small allocations, with a very long tail of larger allocations. Remember, size is on a log
scale, so that long tail represents a very skewed distribution.
Small allocation hotspots
Now for something more complex. Suppose we’re profiling our code and looking for low-hanging fruit for reducing the
number of memory allocations. Code that makes a lot of small allocations tends to be a good candidate for this kind of
refactoring. bpftrace lets us grab stack traces, which can be
used to see what part of the main program caused an allocation.
As of writing, there’s one little complication because of a limitation of bpftrace’s stack trace handling: it can only show meaningful function
symbol names (as opposed to raw memory addresses) if bpftrace
quits while the target program is still running. There’s an
open bug report for improving this behaviour, but in the meantime I just made sure dscanner took a long time, and that I shut down bpftrace first.
Here’s how to grab the top three stack traces that lead to small (<16B) memory allocations with qalloc():
It looks like a lot of the small allocations are due to a red-black tree in ModuleCache.
What’s next?
I think these examples already show that bpftrace is a
pretty powerful tool. There’s a lot more that can done, and I highly recommended reading Brendan Gregg’s eBPF tutorials.
I used uprobes to trace arbitrary functions in the D runtime. The pro of this is the freedom to do anything, but the
cons are that I had to refer to the D runtime source code and manually deal with the D ABI. There’s also no guarantee
that a script I write today will work with future versions of the runtime. Linux also supports making well-defined
tracepoints in user code using a feature called USDT. That should let D
code export stable tracepoints that can be used without worrying about the D ABI. I might do more experiments in
future.