Why Textbook Statistical Methods aren't as Effective in IT

Published 01 December 2021

If you work with tech, there’s a good chance you’ve come across some of the following statistical tools:

Averages
Standard deviations
t-tests
Least-squares line of best fit

These are the most common tools in a kit that’s typically taught in undergraduate statistics classes and widely used in the outside world. However, this toolkit just isn’t that effective in most IT applications (such as analysing performance benchmarks). Fortunately, there are other tools that do work well. They’re normally taught in “advanced” statistics classes, but I think some of them should become the standard toolkit for tech work (and possibly elsewhere).

In this post I want to talk a bit about why the usual toolkit doesn’t work well. First, let me give an example.

Reverse Engineering a Docker Image

Published 18 March 2021

This started with a consulting snafu: Government organisation A got government organisation B to develop a web application. Government organisation B subcontracted part of the work to somebody. Hosting and maintenance of the project was later contracted out to a private-sector company C. Company C discovered that the subcontracted somebody (who was long gone) had built a custom Docker image and made it a dependency of the build system, but without committing the original Dockerfile. That left company C with a contractual obligation to manage a Docker image they had no source code for. Company C calls me in once in a while to do various things, so doing something about this mystery meat Docker image became my job.

Fortunately, the Docker image format is a lot more transparent than it could be. A little detective work is needed, but a lot can be figured out just by pulling apart an image file. As an example, here’s a quick walkthrough of an image for the Prettier code formatter. (In fact, it’s so easy, there’s a tool for it. Thanks Ezequiel Gonzalez Rial.)

D Declarations for C and C++ Programmers

Published 18 August 2020

Tags: dlang , C , C++ and Programming Languages

Because D was originally created by a C++ compiler writer, Walter Bright, it’s an easy language for C and C++ programmers to learn, but there are little differences in the way declarations work. I learned them piecemeal in different places, but I’m going to dump a bunch in this one post.

Some Useful Probability Facts for Systems Programming

Published 27 January 2020

Tags: Systems Design , Mathematics and Computer Science

Probability problems come up a lot in systems programming, and I’m using that term loosely to mean everything from operating systems programming and networking, to building large online services, to creating virtual worlds like in games. Here’s a bunch of rough-and-ready probability rules of thumb that are deeply related and have many practical applications when designing systems.

Debugging Software Deployments with strace

Published 14 November 2019

Tags: Ops , Tools , Reliability , Servers and Low Level

Translations:русский

Most of my paid work involves deploying software systems, which means I spend a lot of time trying to answer the following questions:

This software works on the original developer’s machine, so why doesn’t it work on mine?
This software worked on my machine yesterday, so why doesn’t it work today?

That’s a kind of debugging, but it’s a different kind of debugging from normal software debugging. Normal debugging is usually about the logic of the code, but deployment debugging is usually about the interaction between the code and its environment. Even when the root cause is a logic bug, the fact that the software apparently worked on another machine means that the environment is usually involved somehow.

So, instead of using normal debugging tools like gdb, I have another toolset for debugging deployments. My favourite tool for “Why isn’t this software working on this machine?” is strace.

Object-Oriented Programming and Essential State

Published 13 October 2019

Tags: Software Engineering and Systems Design

Translations:中文

Back in 2015, Brian Will wrote a provocative blog post: Object-Oriented Programming: A Disaster Story. He followed it up with a video called Object-Oriented Programming is Bad, which is much more detailed. I recommend taking the time to watch the video, but here’s my one-paragraph summary:

The Platonic ideal of OOP is a sea of decoupled objects that send stateless messages to one another. No one really makes software like that, and Brian points out that it doesn’t even make sense: objects need to know which other objects to send messages to, and that means they need to hold references to one another. Most of the video is about the pain that happens trying to couple objects for control flow, while pretending that they’re decoupled by design.

Overall his ideas resonate with my own experiences of OOP: objects can be okay, but I’ve just never been satisfied with object-orientation for modelling a program’s control flow, and trying to make code “properly” object-oriented always seems to create layers of unneccessary complexity.

There’s one thing I don’t think he explains fully. He says outright that “encapsulation does not work”, but follows it with the footnote “at fine-grained levels of code”, and goes on to acknowledge that objects can sometimes work, and that encapsulation can be okay at the level of, say, a library or file. But he doesn’t explain exactly why it sometimes works and sometimes doesn’t, and how/where to draw the line. Some people might say that makes his “OOP is bad” claim flawed, but I think his point stands, and that the line can be drawn between essential state and accidental state.

Why const Doesn't Make C Code Faster

Published 12 August 2019, Updated 24 August 2019

Tags: Performance , C , C++ , Low Level and Programming Languages

Translations:中文, русский

In a post a few months back I said it’s a popular myth that const is helpful for enabling compiler optimisations in C and C++. I figured I should explain that one, especially because I used to believe it was obviously true, myself. I’ll start off with some theory and artificial examples, then I’ll do some experiments and benchmarks on a real codebase: Sqlite.

Profiling D's Garbage Collection with Bpftrace

Published 26 April 2019

Tags: dlang , Performance , Tools and Low Level

Recently I’ve been playing around with using bpftrace to trace and profile D’s garbage collector. Here are some examples of the cool stuff that’s possible.

D as a C Replacement

Published 05 April 2019, Updated 07 July 2019

Tags: dlang and Programming Languages

Sircmpwn (the main developer behind the Sway Wayland compositor) recently wrote a blog post about how he thinks Rust is not a good C replacement. I don’t know if he’d like the D programming language either, but it’s become a C replacement for me.

Hello World Marketing (or, How I Find Good, Boring Software)

Published 19 March 2019

Tags: Sociotechnology , Tools , Servers , Software Engineering and Reliability

Back in 2001 Joel Spolsky wrote his classic essay “Good Software Takes Ten Years. Get Used To it”. Nothing much has changed since then: software is still taking around a decade of development to get good, and the industry is still getting used to that fact. Unfortunately, the industry has investors who want to see hockey stick growth rates on software that’s a year old or less. The result is an antipattern I like to call “Hello World Marketing”. Once you start to notice it, you see it everywhere, and it’s a huge red flag when choosing software tools.