How Real-World Apps Lose Data

Published 06 June 2021

A great thing about modern app development is that there are cloud providers to worry about things like hardware failures or how to set up RAID. Decent cloud providers are extremely unlikely to lose your app’s data, so sometimes I get asked what backups are really for these days. Here are some real-world stories that show exactly what.

Robust and Race-free Server Logging using Named Pipes

Published 10 October 2020

Tags: Ops , Reliability , Servers , Tools and Systems Design

If you do any server administration work, you’ll have worked with log files. And if your servers need to be reliable, you’ll know that log files are common source of problems, especially when you need to rotate or ship them (which is practically always). In particular, moving files around causes race conditions.

Thankfully, there are better ways. With named pipes, you can have a simple and robust logging stack, with no race conditions, and without patching your servers to support some network logging protocol.

PSA: Please Just Close Your Files

Published 06 July 2020

Tags: Reliability and Software Engineering

An article called “Contrarian view on closing files” is on the front page of Hacker News right now with over 60 upvotes. It starts off with a quote from the Google Python style guide:

Explicitly close files and sockets when done with them. Leaving files, sockets or other file-like objects open unnecessarily has many downsides […]

The article’s main complaint is that “this advice is applying a notably higher standard of premature optimization to file descriptors than to any other kind of resource”. It complains that it’s “depressingly commonplace” to see code like this:

with open("README.md") as readme:
    long_description = readme.read()

Sure, if it’s a one-off read of a doc file, you can almost certainly get away with just open("README.md").read(), but I honestly have no idea what’s depressing about code that just works reliably.

Leaving files and sockets open is something that you can usually get away with, until weird stuff happens. And, no, it’s not just about running out of file descriptors (although that does happen, too). I’ve previously written about servers that mysteriously ran out of disk space because (spoiler) they were deleting files that hadn’t been closed properly. Regardless of how awesome your computer and network equipment are, the number of TCP connections you can make to a given host and port are limited by a 16 bit integer (and practically always limited more by your network settings), then you get network failures. Writing to files that you don’t explicitly close (or flush) is especially dicey — the write might actually happen straight away, then on another day or another environment it might get delayed.

Sure, these failure modes aren’t very common, but they definitely happen. I can think of three examples right now from the past few years of my day job. Closing files as a habit is easier than figuring out when it’ll go wrong.

The article has an example of a function that lazily loads JSON given a file path string. Sure, it doesn’t work properly if you close the file inside the function, but I’d say the problem is in the API design: the file handle resource straddles the interface.

The first good alternative is to keep the file handle completely inside: take a path argument, load the data eagerly, close the file and return the parsed data. The other good alternative is to keep the file handle completely outside: take a file handle as argument and return it wrapped in a lazy JSON parser. Either way makes it easier to see how and where the file should be closed.

The Google advice is pretty solid: when you’re done with a file or socket, just close it. I’ll add: sure, maybe it won’t always be obvious when you’re done with a handle, but perhaps that code design is making life more “exciting” than necessary. Production failures are more depressing than file closing code ever will be.

Debugging Software Deployments with strace

Published 14 November 2019

Tags: Ops , Tools , Reliability , Servers and Low Level

Translations:русский

Most of my paid work involves deploying software systems, which means I spend a lot of time trying to answer the following questions:

This software works on the original developer’s machine, so why doesn’t it work on mine?
This software worked on my machine yesterday, so why doesn’t it work today?

That’s a kind of debugging, but it’s a different kind of debugging from normal software debugging. Normal debugging is usually about the logic of the code, but deployment debugging is usually about the interaction between the code and its environment. Even when the root cause is a logic bug, the fact that the software apparently worked on another machine means that the environment is usually involved somehow.

So, instead of using normal debugging tools like gdb, I have another toolset for debugging deployments. My favourite tool for “Why isn’t this software working on this machine?” is strace.

Analysing D Code with KLEE

Published 28 May 2019

Tags: dlang , Reliability , Security , Tools and Computer Science

KLEE is a symbolic execution engine that can rigorously verify or find bugs in software. It’s designed for C and C++, but it’s just an interpreter for LLVM bitcode combined with theorem prover backends, so it can work with bitcode generated by ldc2. One catch is that it needs a compatible bitcode port of the D runtime to run normal D code. I’m still interested in getting KLEE to work with normal D code, but for now I’ve done some experiments with -betterC D.

Hello World Marketing (or, How I Find Good, Boring Software)

Published 19 March 2019

Tags: Sociotechnology , Tools , Servers , Software Engineering and Reliability

Back in 2001 Joel Spolsky wrote his classic essay “Good Software Takes Ten Years. Get Used To it”. Nothing much has changed since then: software is still taking around a decade of development to get good, and the industry is still getting used to that fact. Unfortunately, the industry has investors who want to see hockey stick growth rates on software that’s a year old or less. The result is an antipattern I like to call “Hello World Marketing”. Once you start to notice it, you see it everywhere, and it’s a huge red flag when choosing software tools.

Some Presentation Slides

Published 16 February 2019

Tags: Careers , Business , Ops , Tools , Reliability and Me

Here are the slide decks to a couple of talks I’ve given recently.

Being Self-Employed in Australia (at JAIT)

Because this talk is based on my own experiences, it’s particularly relevant to service businesses in Australia. But if you’re interested in being your own boss, anywhere or anyhow, you could find it useful. As I said in the talk, there’s a lot of stuff that feels obvious to me now, but I ended up learning the hard way.

Introduction to Infrastructure as Code (at RORO Sydney)

Here’s a common story: Devs write an app, and do all the right things like using source control and writing automated test suites. Then it comes to deploy the code, and they have to figure out all these things like DNS and server infrastructure. They hack something together using web UIs, but six months later no one can remember the deployment process any more.

This presentation was a really quick introduction to the tools you can use to get more app dependencies into source control.

Unfortunately, Garbage Collection isn't Enough

Published 05 December 2018

Tags: Software Engineering , Programming Languages , Systems Design , Reliability , Ops and Anecdotes

Translations:русский

Here’s a little story of some mysterious server failures I had to debug a year ago. The servers would run okay for a while, then eventually start crashing. After that, trying to run practically anything on the machines failed with “No space left on device” errors, but the filesystem only reported a few gigabytes of files on the ~20GB disks.

If You Need Lifeboats, That Means Your Ship is Sinking

Published 25 April 2017

Tags: Reliability , Ops and Software Engineering

It’s 1912 and Captain Edward Smith is boarding the RMS Titanic. He sees the lifeboats on deck and shakes his head with a heavy sigh before turning to the crew. “In my experience, I’ve never needed lifeboats. They’re not best practices — if you need lifeboats, that means your ship is sinking!” The crew members are enlightened and eagerly throw all lifeboats overboard. The Titanic begins its voyage to New York.

The Art of Machinery

How Real-World Apps Lose Data

Robust and Race-free Server Logging using Named Pipes

PSA: Please Just Close Your Files

Debugging Software Deployments with strace

Analysing D Code with KLEE

Hello World Marketing (or, How I Find Good, Boring Software)

Some Presentation Slides

Being Self-Employed in Australia (at JAIT)

Introduction to Infrastructure as Code (at RORO Sydney)

Unfortunately, Garbage Collection isn't Enough

If You Need Lifeboats, That Means Your Ship is Sinking

Why Defensive Coding Matters - A War Story