Robust and Race-free Server Logging using Named Pipes

Published 10 October 2020

If you do any server administration work, you’ll have worked with log files. And if your servers need to be reliable, you’ll know that log files are common source of problems, especially when you need to rotate or ship them (which is practically always). In particular, moving files around causes race conditions.

Thankfully, there are better ways. With named pipes, you can have a simple and robust logging stack, with no race conditions, and without patching your servers to support some network logging protocol.

Scaling a GraphQL Website

Published 29 June 2020, Updated 09 December 2020

Tags: Performance , Systems Design , Servers and Me

I normally write abstractly about work I’ve done for other people (for obvious reasons), but I’ve been given permission to write about a website, Vocal, that I did some SRE work on last year. I actually gave a presentation at GraphQL Sydney back in February, but this blog post got delayed a bit.

Vocal is a GraphQL-based website that got traction and hit scaling problems that I got called in to fix. Here’s what I did. Obviously, you’ll find this post useful if you’re scaling another GraphQL website, but most of it’s representative of what you have to deal with when a site first gets enough traffic to cause technical problems. If website scalability is a key interest of yours, you might want to read my recent post about scalability first.

What is a High Traffic Website?

Published 21 April 2020

Tags: Systems Design , Servers and Performance

Terms like “high traffic” are hazardous when designing online services because salespeople, business analysts and engineers all have different perspectives about what they mean. If we’re talking about, say, a high-stakes online poker room, then “high traffic” for the business side will be very low compared to what it is for the technical side. However, all these people will be in a meeting room together making decisions, using the same words to mean different things. It’s obvious how that can lead to bad (and sometimes expensive) choices.

A lot of my day job is talking to business stakeholders and figuring out the technical solutions they need, so this is a problem I have to deal with. So I’ve got my own purely technical way to think about traffic levels for online services.

Debugging Software Deployments with strace

Published 14 November 2019

Tags: Ops , Tools , Reliability , Servers and Low Level

Translations:русский

Most of my paid work involves deploying software systems, which means I spend a lot of time trying to answer the following questions:

This software works on the original developer’s machine, so why doesn’t it work on mine?
This software worked on my machine yesterday, so why doesn’t it work today?

That’s a kind of debugging, but it’s a different kind of debugging from normal software debugging. Normal debugging is usually about the logic of the code, but deployment debugging is usually about the interaction between the code and its environment. Even when the root cause is a logic bug, the fact that the software apparently worked on another machine means that the environment is usually involved somehow.

So, instead of using normal debugging tools like gdb, I have another toolset for debugging deployments. My favourite tool for “Why isn’t this software working on this machine?” is strace.

Hello World Marketing (or, How I Find Good, Boring Software)

Published 19 March 2019

Tags: Sociotechnology , Tools , Servers , Software Engineering and Reliability

Back in 2001 Joel Spolsky wrote his classic essay “Good Software Takes Ten Years. Get Used To it”. Nothing much has changed since then: software is still taking around a decade of development to get good, and the industry is still getting used to that fact. Unfortunately, the industry has investors who want to see hockey stick growth rates on software that’s a year old or less. The result is an antipattern I like to call “Hello World Marketing”. Once you start to notice it, you see it everywhere, and it’s a huge red flag when choosing software tools.

Popular Web Servers Compared

Published 07 January 2018

Tags: Ops , Servers and Tools

Here’s a comparison of the web servers I’ve used the most.

Relational Databases Considered Incredibly Useful

Published 28 October 2017

Tags: Servers , Software Engineering , Systems Design , SQL and Tools

A year ago I worked on a web service that had Postgres and Elasticsearch as backends. Postgres was doing most of the work and was the primary source of truth about all data, but some documents were replicated in Elasticsearch for querying. Elasticsearch was easy to get started with, but had an ongoing maintenance cost: it was one more moving part to break down, it occasionally went out of sync with the main database, it was another thing for new developers to install, and it added complexity to the deployment, as well as the integration tests. But most of the features of Elasticsearch weren’t needed because the documents were semi-structured, and the search queries were heavily keyword-based. Dropping Elasticsearch and just using Postgres turned out to work okay. No, I’m not talking about brute-force string matching using LIKE expressions (as implemented in certain popular CMSs); I’m talking about using the featureful text search indexes in good modern databases. Text search with Postgres took more work to implement, and couldn’t do all the things Elasticsearch could, but it was easier to deploy, and since then it’s been zero maintenance. Overall, it’s considered a net win (I talked to some of the developers again just recently).

Terraform is Best for Configuring Hashicorp Vault

Published 15 July 2017

Tags: Ops , Tools and Servers

Hashicorp Vault is a handy tool for scalable secrets management in a distributed system or team-based project. Unfortunately, the only out-of-the-box way to configure it is through its API (or a UI), but most projects that need Vault will need to manage the configuration in source control.

There’s a workaround explained on the Hashicorp blog. It’s a neat hack, but here’s a quick note about why using Terraform’s Vault integration is a better idea for production use.

A Tale of Three Server Caching Architectures

Published 30 July 2016

Tags: Systems Design , Servers , Performance and Software Engineering

Exactly where you put caching in a distributed system has a significant impact on its effectiveness, in ways that aren’t always obvious during the design phase of development.

Offline Compression with Nginx

Published 06 June 2016

Tags: Ops , Servers and Performance

There’s a clear tradeoff with compressing HTTP responses on the fly: compress “harder” and you’ll (hopefully) get a smaller file that takes less time to send over the network – but the net benefit might be negative if the extra work takes too much time, or (when under heavy load) too much CPU. A lot of work has been done analysing this tradeoff, but for static content there’s a neat and simple way to avoid the tradeoff completely: compress offline before serving. Nginx supports this using the gzip_static module.