Story time.

This is a true story, but for the sake of discretion, I’ll pretend it’s about a web service that delivers cat pictures. I’ve drawn an architecture diagram below. User requests come into the Cats as a Service frontend, which is basically responsible for cramming as many cats onto a webpage as possible. A single user request fans out to many requests to the Cat-o-matic backend, which each fan out to many requests to the cat picture database. The database is just a standard third-party database storing cat pictures and metadata. The Cat-o-matic gets requests like “give me three siamese cats in a rose garden at 480x620px” and delivers responses by doing a whole lot of processing of cat picture data from the database.

Cats as a Service architecture diagram
In reality, each box in the diagram is actually a cluster of servers running behind load balancers. There are also caches and monitoring servers, etc., that haven't been drawn in.

Now, anyone who’s developed any serious web service knows that a lot of code ends up being sanity checking. No one needs a one billion by one billion pixel picture of a cat’s nose, but without sanity checking it’s only a matter of time before a user (maliciously or by accident) makes that request and crashes a server.

Sanity checking requests wasn’t the biggest priority for the team working on the Cat-o-matic, though. The biggest priority for them was implementing the increasingly complex cat-pic-processing features that were needed by the frontend team. Besides, it wasn’t as if the Cat-o-matic were on the open internet; all requests to the Cat-o-matic came from the CaaS, which was developed by intelligent engineers who wouldn’t make stupid requests.

On the other hand, although the CaaS had sanity checking of user requests, it turned out it didn’t have sanity checking of backend requests it produced. Why would it? The Cat-o-matic was written by intelligent engineers who knew what they were doing.

It should be obvious how things went wrong. The ironic twist is that it wasn’t real user requests that caused the outage, but fake user requests from the monitoring stack. A new feature was added to the CaaS, and an innocent-looking test case was added to the black-box monitoring.

So, once every few minutes, the monitoring stack would check if the new CaaS feature were working by sending it a request. The CaaS would take the request, process it a bit, and then turn around to rain an unholy armageddon of insane requests on the Cat-o-matic servers. None of the victim servers survived to report back, so eventually the CaaS would give up waiting and deliver an error page to the monitoring stack. The monitoring stack would record the failure and then simply try again a few minutes later – after all the servers that had crashed the previous round had finished restarting.

Postmortem

The incident itself was exciting (although it turned out enough servers were up at any time to keep serving end users), but the fix was mundane after the cause was identified: roll back the new monitoring config, and then do a review of the codebase looking for places to add more checking.

But there were some good morals to the story. First, it was a gentle reminder that any production change can cause a system failure. The vulnerable code had all been pushed under the usual safety checks, but the crashing was only triggered when the new monitoring configs were loaded.

Second, it’s only an illusion that makes deep processing pipelines feel like a defence against weird requests. The truth is the opposite: processing tends to create weird requests.

Third, communication issues were obviously a major factor. If the engineers who pushed the monitoring config had immediate feedback that the backend servers were crashing, they would have rolled back immediately. Instead, they were tracing their monitoring check failures downstream while the backend engineers were tracing their server crashes upstream. Of course, the outage wouldn’t have happened at all if the two teams had figured out who was responsible for sanity checking what in the first place. Some of these lines of communication could have been improved (and they were afterwards), but ultimately communication failures are going to start happening once a team grows beyond a size of one. Projects need to be made robust to that.

Which comes to the main moral of the story: defensive coding needs to be the culture. I choose the word “culture” deliberately because the point is that in practice you’ll either have a culture of defensive coding or a culture of defenceless coding. A situation where everyone knows exactly where in the system assumptions need to be verified and where they don’t is just not scalable. If you can’t look at a single component and see what assumptions are made there and how they are verified, the system is probably broken.