Robust and Race-free Server Logging using Named Pipes

Published

Tags: , , , and

If you do any server administration work, you’ll have worked with log files. And if your servers need to be reliable, you’ll know that log files are common source of problems, especially when you need to rotate or ship them (which is practically always). In particular, moving files around causes race conditions.

Thankfully, there are better ways. With named pipes, you can have a simple and robust logging stack, with no race conditions, and without patching your servers to support some network logging protocol.

PSA: Please Just Close Your Files

Published

Tags: and

An article called “Contrarian view on closing files” is on the front page of Hacker News right now with over 60 upvotes. It starts off with a quote from the Google Python style guide:

Explicitly close files and sockets when done with them. Leaving files, sockets or other file-like objects open unnecessarily has many downsides […]

The article’s main complaint is that “this advice is applying a notably higher standard of premature optimization to file descriptors than to any other kind of resource”. It complains that it’s “depressingly commonplace” to see code like this:

with open("README.md") as readme:
    long_description = readme.read()

Sure, if it’s a one-off read of a doc file, you can almost certainly get away with just open("README.md").read(), but I honestly have no idea what’s depressing about code that just works reliably.

Leaving files and sockets open is something that you can usually get away with, until weird stuff happens. And, no, it’s not just about running out of file descriptors (although that does happen, too). I’ve previously written about servers that mysteriously ran out of disk space because (spoiler) they were deleting files that hadn’t been closed properly. Regardless of how awesome your computer and network equipment are, the number of TCP connections you can make to a given host and port are limited by a 16 bit integer (and practically always limited more by your network settings), then you get network failures. Writing to files that you don’t explicitly close (or flush) is especially dicey — the write might actually happen straight away, then on another day or another environment it might get delayed.

Sure, these failure modes aren’t very common, but they definitely happen. I can think of three examples right now from the past few years of my day job. Closing files as a habit is easier than figuring out when it’ll go wrong.

The article has an example of a function that lazily loads JSON given a file path string. Sure, it doesn’t work properly if you close the file inside the function, but I’d say the problem is in the API design: the file handle resource straddles the interface.

The first good alternative is to keep the file handle completely inside: take a path argument, load the data eagerly, close the file and return the parsed data. The other good alternative is to keep the file handle completely outside: take a file handle as argument and return it wrapped in a lazy JSON parser. Either way makes it easier to see how and where the file should be closed.

The Google advice is pretty solid: when you’re done with a file or socket, just close it. I’ll add: sure, maybe it won’t always be obvious when you’re done with a handle, but perhaps that code design is making life more “exciting” than necessary. Production failures are more depressing than file closing code ever will be.

Debugging Software Deployments with strace

Published

Tags: , , , and

Translations:русский

Most of my paid work involves deploying software systems, which means I spend a lot of time trying to answer the following questions:

That’s a kind of debugging, but it’s a different kind of debugging from normal software debugging. Normal debugging is usually about the logic of the code, but deployment debugging is usually about the interaction between the code and its environment. Even when the root cause is a logic bug, the fact that the software apparently worked on another machine means that the environment is usually involved somehow.

So, instead of using normal debugging tools like gdb, I have another toolset for debugging deployments. My favourite tool for “Why isn’t this software working on this machine?” is strace.

Analysing D Code with KLEE

Published

Tags: , , , and

KLEE is a symbolic execution engine that can rigorously verify or find bugs in software. It’s designed for C and C++, but it’s just an interpreter for LLVM bitcode combined with theorem prover backends, so it can work with bitcode generated by ldc2. One catch is that it needs a compatible bitcode port of the D runtime to run normal D code. I’m still interested in getting KLEE to work with normal D code, but for now I’ve done some experiments with -betterC D.

Hello World Marketing (or, How I Find Good, Boring Software)

Published

Tags: , , , and

Back in 2001 Joel Spolsky wrote his classic essay “Good Software Takes Ten Years. Get Used To it”. Nothing much has changed since then: software is still taking around a decade of development to get good, and the industry is still getting used to that fact. Unfortunately, the industry has investors who want to see hockey stick growth rates on software that’s a year old or less. The result is an antipattern I like to call “Hello World Marketing”. Once you start to notice it, you see it everywhere, and it’s a huge red flag when choosing software tools.

Some Presentation Slides

Published

Tags: , , , , and

Here are the slide decks to a couple of talks I’ve given recently.

Being Self-Employed in Australia (at JAIT)

Because this talk is based on my own experiences, it’s particularly relevant to service businesses in Australia. But if you’re interested in being your own boss, anywhere or anyhow, you could find it useful. As I said in the talk, there’s a lot of stuff that feels obvious to me now, but I ended up learning the hard way.

Introduction to Infrastructure as Code (at RORO Sydney)

Here’s a common story: Devs write an app, and do all the right things like using source control and writing automated test suites. Then it comes to deploy the code, and they have to figure out all these things like DNS and server infrastructure. They hack something together using web UIs, but six months later no one can remember the deployment process any more.

This presentation was a really quick introduction to the tools you can use to get more app dependencies into source control.

Unfortunately, Garbage Collection isn't Enough

Translations:русский

Here’s a little story of some mysterious server failures I had to debug a year ago. The servers would run okay for a while, then eventually start crashing. After that, trying to run practically anything on the machines failed with “No space left on device” errors, but the filesystem only reported a few gigabytes of files on the ~20GB disks.

If You Need Lifeboats, That Means Your Ship is Sinking

Published

Tags: , and

It’s 1912 and Captain Edward Smith is boarding the RMS Titanic. He sees the lifeboats on deck and shakes his head with a heavy sigh before turning to the crew. “In my experience, I’ve never needed lifeboats. They’re not best practices — if you need lifeboats, that means your ship is sinking!” The crew members are enlightened and eagerly throw all lifeboats overboard. The Titanic begins its voyage to New York.

Why Defensive Coding Matters - A War Story

Published

Tags: , , and

Story time.