A great thing about modern app development is that there are cloud providers to worry about things like hardware
failures or how to set up RAID. Decent cloud providers are extremely unlikely to lose your app’s data, so sometimes I
get asked what backups are really for these days. Here are some real-world stories that show exactly what.
If you do any server administration work, you’ll have worked with log files. And if your servers need to be
reliable, you’ll know that log files are common source of problems, especially when you need to rotate or ship them
(which is practically always). In particular, moving files around causes race conditions.
Thankfully, there are better ways. With named pipes, you can have a simple and robust logging stack, with no race
conditions, and without patching your servers to support some network logging protocol.
Explicitly close files and sockets when done with them. Leaving files, sockets or other file-like objects open
unnecessarily has many downsides […]
The article’s main complaint is that “this advice is applying a notably higher standard of premature optimization to
file descriptors than to any other kind of resource”. It complains that it’s “depressingly commonplace” to see code
like this:
Sure, if it’s a one-off read of a doc file, you can almost certainly get away with just open("README.md").read(), but I honestly have no idea what’s depressing
about code that just works reliably.
Leaving files and sockets open is something that you can usually get away with, until weird stuff happens. And, no,
it’s not just about running out of file descriptors (although that does happen, too). I’ve previously written about
servers that mysteriously ran out of disk space because (spoiler) they were
deleting files that hadn’t been closed properly. Regardless of how awesome your computer and network equipment are,
the number of TCP connections you can make to a given host and port are limited by a 16 bit integer (and practically
always limited more by your network
settings), then you get network failures. Writing to files that you don’t explicitly close (or flush) is especially
dicey — the write might actually happen straight away, then on another day or another environment it might get
delayed.
Sure, these failure modes aren’t very common, but they definitely happen. I can think of three examples right now
from the past few years of my day job. Closing files as a habit is easier than figuring out when it’ll go wrong.
The article has an example of a function that lazily loads JSON given a file path string. Sure, it doesn’t work
properly if you close the file inside the function, but I’d say the problem is in the API design: the file handle
resource straddles the interface.
The first good alternative is to keep the file handle completely inside: take a path argument, load the data
eagerly, close the file and return the parsed data. The other good alternative is to keep the file handle completely
outside: take a file handle as argument and return it wrapped in a lazy JSON parser. Either way makes it easier to see
how and where the file should be closed.
The Google advice is pretty solid: when you’re done with a file or socket, just close it. I’ll add: sure, maybe it
won’t always be obvious when you’re done with a handle, but perhaps that code design is making life more “exciting”
than necessary. Production failures are more depressing than file closing code ever will be.
Most of my paid work involves deploying software systems, which means I spend a lot of time trying to answer the
following questions:
This software works on the original developer’s machine, so why doesn’t it work on mine?
This software worked on my machine yesterday, so why doesn’t it work today?
That’s a kind of debugging, but it’s a different kind of debugging from normal software debugging. Normal debugging
is usually about the logic of the code, but deployment debugging is usually about the interaction between the code and
its environment. Even when the root cause is a logic bug, the fact that the software apparently worked on another
machine means that the environment is usually involved somehow.
So, instead of using normal debugging tools like gdb, I
have another toolset for debugging deployments. My favourite tool for “Why isn’t this software working on this
machine?” is strace.
KLEE is a symbolic execution engine that can rigorously verify or find bugs in
software. It’s designed for C and C++, but it’s just an interpreter for LLVM bitcode combined with theorem prover
backends, so it can work with bitcode generated by ldc2. One
catch is that it needs a compatible bitcode port of the D runtime to run normal D code. I’m still interested in getting
KLEE to work with normal D code, but for now I’ve done some experiments with -betterC D.
Back in 2001 Joel Spolsky wrote his classic essay “Good Software Takes Ten
Years. Get Used To it”. Nothing much has changed since then: software is still taking around a decade of
development to get good, and the industry is still getting used to that fact. Unfortunately, the industry has investors
who want to see hockey stick growth rates on software that’s a year old or less. The result is an antipattern I like to
call “Hello World Marketing”. Once you start to notice it, you see it everywhere, and it’s a huge red flag when
choosing software tools.
Because this talk is based on my own experiences, it’s particularly relevant to service businesses in Australia. But
if you’re interested in being your own boss, anywhere or anyhow, you could find it useful. As I said in the talk,
there’s a lot of stuff that feels obvious to me now, but I ended up learning the hard way.
Here’s a common story: Devs write an app, and do all the right things like using source control and writing
automated test suites. Then it comes to deploy the code, and they have to figure out all these things like DNS and
server infrastructure. They hack something together using web UIs, but six months later no one can remember the
deployment process any more.
This presentation was a really quick introduction to the tools you can use to get more app dependencies into source
control.
Here’s a little story of some mysterious server failures I had to debug a year ago. The servers would run okay for a
while, then eventually start crashing. After that, trying to run practically anything on the machines failed with “No
space left on device” errors, but the filesystem only reported a few gigabytes of files on the ~20GB disks.
It’s 1912 and Captain Edward Smith is boarding the RMS Titanic. He sees the lifeboats on deck and shakes his head
with a heavy sigh before turning to the crew. “In my experience, I’ve never needed lifeboats. They’re not best
practices — if you need lifeboats, that means your ship is sinking!” The crew members are enlightened and
eagerly throw all lifeboats overboard. The Titanic begins its voyage to New York.