Robust and Race-free Server Logging using Named Pipes


Tags: , , , and

If you do any server administration work, you’ll have worked with log files. And if your servers need to be reliable, you’ll know that log files are common source of problems, especially when you need to rotate or ship them (which is practically always). In particular, moving files around causes race conditions.

Thankfully, there are better ways. With named pipes, you can have a simple and robust logging stack, with no race conditions, and without patching your servers to support some network logging protocol.

Scaling a GraphQL Website

Published , Updated

Tags: , , and

I normally write abstractly about work I’ve done for other people (for obvious reasons), but I’ve been given permission to write about a website, Vocal, that I did some SRE work on last year. I actually gave a presentation at GraphQL Sydney back in February, but this blog post got delayed a bit.

Vocal is a GraphQL-based website that got traction and hit scaling problems that I got called in to fix. Here’s what I did. Obviously, you’ll find this post useful if you’re scaling another GraphQL website, but most of it’s representative of what you have to deal with when a site first gets enough traffic to cause technical problems. If website scalability is a key interest of yours, you might want to read my recent post about scalability first.

What is a High Traffic Website?


Tags: , and

Terms like “high traffic” are hazardous when designing online services because salespeople, business analysts and engineers all have different perspectives about what they mean. If we’re talking about, say, a high-stakes online poker room, then “high traffic” for the business side will be very low compared to what it is for the technical side. However, all these people will be in a meeting room together making decisions, using the same words to mean different things. It’s obvious how that can lead to bad (and sometimes expensive) choices.

A lot of my day job is talking to business stakeholders and figuring out the technical solutions they need, so this is a problem I have to deal with. So I’ve got my own purely technical way to think about traffic levels for online services.

Some Useful Probability Facts for Systems Programming


Tags: , and

Probability problems come up a lot in systems programming, and I’m using that term loosely to mean everything from operating systems programming and networking, to building large online services, to creating virtual worlds like in games. Here’s a bunch of rough-and-ready probability rules of thumb that are deeply related and have many practical applications when designing systems.

The Enterprise Content Management System


Tags: , , and

A few years ago I worked on the version 2 of some big enterprise’s internal website. A smaller company had the contract, and I’d been subcontracted to deal with deployment and any serverside/backend changes.

The enterprise side had a committee to figure out lists of requirements. Committees are famously bad at coming up with simple and clear specs, and prone to bikeshedding. Thankfully, the company I was contracting with had a project manager who had the job of engaging with the committee for hours each day so that the rest of us didn’t have to. However, we still got a constant stream of inane change requests. (One particular feature of the site changed name three times in about two months.)

It was pretty obvious early on what was happening, so I integrated the existing website backend with a content management system (CMS) that had an admin panel with a friendly WYSIWYG editor. New features got implemented as plugins to the CMS, and old features got migrated as needed. We couldn’t make everything customisable, but eventually we managed to push back on several change requests by saying, “You can customise that whenever you want through the admin panel.”

So, we got things done to satisfaction and delivered, but there was one complication: using the admin panel and WYSIWYG editor. The committee members wouldn’t use it because they were ideas people and didn’t implement anything. The company had IT staff who managed things like websites, but they were hired as technical staff, not for editing website content. On the other hand, they had staff hired for writing copy, but they weren’t hired as website administrators.

So here’s how they ended up using the CMS: CMS data would get rendered as HTML by the website backend, which would then be exported to PDF documents by IT staff. The PDF documents would be converted to Word documents and sent to the writers via email. The writers would edit the documents and send them back to the IT staff, who would do a side-by-side comparison with the originals and then manually enter the changes through the graphical editor in the admin panel. All of the stakeholders were delighted to have a shiny version 2 of the website that had a bunch of new features, was highly customisable, integrated well with their existing processes and was all within budget.

Nowadays, when I’m designing something and I think it’s obvious how it will be used, I remind myself about that CMS and its user-friendly, graphical editor.

Object-Oriented Programming and Essential State


Tags: and


Back in 2015, Brian Will wrote a provocative blog post: Object-Oriented Programming: A Disaster Story. He followed it up with a video called Object-Oriented Programming is Bad, which is much more detailed. I recommend taking the time to watch the video, but here’s my one-paragraph summary:

The Platonic ideal of OOP is a sea of decoupled objects that send stateless messages to one another. No one really makes software like that, and Brian points out that it doesn’t even make sense: objects need to know which other objects to send messages to, and that means they need to hold references to one another. Most of the video is about the pain that happens trying to couple objects for control flow, while pretending that they’re decoupled by design.

Overall his ideas resonate with my own experiences of OOP: objects can be okay, but I’ve just never been satisfied with object-orientation for modelling a program’s control flow, and trying to make code “properly” object-oriented always seems to create layers of unneccessary complexity.

There’s one thing I don’t think he explains fully. He says outright that “encapsulation does not work”, but follows it with the footnote “at fine-grained levels of code”, and goes on to acknowledge that objects can sometimes work, and that encapsulation can be okay at the level of, say, a library or file. But he doesn’t explain exactly why it sometimes works and sometimes doesn’t, and how/where to draw the line. Some people might say that makes his “OOP is bad” claim flawed, but I think his point stands, and that the line can be drawn between essential state and accidental state.

Data Still Dominates


Tags: , and

Translations: русский

Here’s a quote from Linus Torvalds in 2006:

I’m a huge proponent of designing your code around the data, rather than the other way around, and I think it’s one of the reasons git has been fairly successful… I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

Which sounds a lot like Eric Raymond’s “Rule of Representation” from 2003:

Fold knowledge into data, so program logic can be stupid and robust.

Which was just his summary of ideas like this one from Rob Pike in 1989:

Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Which cites Fred Brooks from 1975:

Representation is the Essence of Programming

Beyond craftmanship lies invention, and it is here that lean, spare, fast programs are born. Almost always these are the result of strategic breakthrough rather than tactical cleverness. Sometimes the strategic breakthrough will be a new algorithm, such as the Cooley-Tukey Fast Fourier Transform or the substitution of an n log n sort for an n2 set of comparisons.

Much more often, strategic breakthrough will come from redoing the representation of the data or tables. This is where the heart of your program lies. Show me your flowcharts and conceal your tables, and I shall be continued to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

So, smart people have been saying this again and again for nearly half a century: focus on the data first. But sometimes it feels like the most famous piece of smart programming advice that everyone forgets.

Let me give some real examples.

Unfortunately, Garbage Collection isn't Enough


Here’s a little story of some mysterious server failures I had to debug a year ago. The servers would run okay for a while, then eventually start crashing. After that, trying to run practically anything on the machines failed with “No space left on device” errors, but the filesystem only reported a few gigabytes of files on the ~20GB disks.

Relational Databases Considered Incredibly Useful


Tags: , , , and

A year ago I worked on a web service that had Postgres and Elasticsearch as backends. Postgres was doing most of the work and was the primary source of truth about all data, but some documents were replicated in Elasticsearch for querying. Elasticsearch was easy to get started with, but had an ongoing maintenance cost: it was one more moving part to break down, it occasionally went out of sync with the main database, it was another thing for new developers to install, and it added complexity to the deployment, as well as the integration tests. But most of the features of Elasticsearch weren’t needed because the documents were semi-structured, and the search queries were heavily keyword-based. Dropping Elasticsearch and just using Postgres turned out to work okay. No, I’m not talking about brute-force string matching using LIKE expressions (as implemented in certain popular CMSs); I’m talking about using the featureful text search indexes in good modern databases. Text search with Postgres took more work to implement, and couldn’t do all the things Elasticsearch could, but it was easier to deploy, and since then it’s been zero maintenance. Overall, it’s considered a net win (I talked to some of the developers again just recently).

Compression, Complexity and Software System Design


Tags: , and

Information theory gives us a way to understand communication systems, by giving us a way to understand what happens to information as it’s transmitted or re-encoded.

We can also study what happens to complexity in a software system as components depend on or interface each other. Just like we can make rigorous arguments about how much information can be compressed, we can make arguments about how much complexity can be simplified, and use this to make better choices when designing software systems.