Compression, Complexity and Software System Design

Information theory gives us a way to understand communication systems, by giving us a way to understand what happens to information as it’s transmitted or re-encoded.

We can also study what happens to complexity in a software system as components depend on or interface each other. Just like we can make rigorous arguments about how much information can be compressed, we can make arguments about how much complexity can be simplified, and use this to make better choices when designing software systems.

The Limits of Compression

It’s well known in computer science that there’s no free lunch in data compression. Compressing a text file once makes it much smaller; compressing it more times makes it bigger. Compressing random data isn’t helpful at all.

Thanks to the concept of entropy, information theory gives us hard, theoretical limits on how much any given data format can be compressed. English text documents have been experimentally estimated to have a little over 1 bit of entropy per 8-bit character, which suggests that the absolute best possible lossless text compression algorithm could shrink files by under eight times, by cutting out the redundancy. If you want smaller files, you need to accept some loss of information — like the compression used in JPEG images and most VP9 videos.

The counting argument is a famous proof that there’s no compression algorithm that can compress everything. What the argument actually proves is that any encoding scheme that makes some things smaller must also make other things bigger. Real world compression algorithms are carefully designed to make useful things (like media files) shorter, and useless things (like random junk) longer. There really is no free lunch, and any naïve attempt to get one will backfire.

No Silver Bullet

Fred Brooks wrote his classic essay No Silver Bullet in the 1980s. He pointed out that the complexity of software projects can be split into what he called essential complexity and accidental complexity. The end user doesn’t care if the backend systems use JSON or XML, so that’s accidental complexity. But the end user does want the system to integrate with all their favourite other systems, so that’s essential complexity. A new tool or programming paradigm can only reduce the complexity of a project by, say, five times if at least four-fifths of the complexity is accidental complexity that can be eliminated with tools or paradigms. Brooks didn’t think that was true of projects he’d worked on, so he wasn’t expecting a revolution in software development. Improvements, sure, but not a revolution.

If you think of programming as a process of encoding solutions into a computer, Brooks’ essential complexity is like the entropy in information theory, and accidental complexity is like redundancy. (Essential complexity is also related to Kolmogorov complexity, first studied in the 1960s.) To simplify a software project, you have two options: One is to find and remove some accidental complexity. If you can’t do that, you must lose some essential complexity (i.e., drop requirements). Looking at things that way, it should be obvious, but plenty of real-world software projects fail because the people responsible try to invent a third option.

The Inner Platform Effect

Alex Papadimoulis from the The Daily WTF first named the inner platform effect in 2006. It’s an antipattern that appears again and again in big enterprise. A programmer realises that their daily job is taking business requirements from middle managers and turning them into code, and has the bright idea of making a 100% configurable system that anyone can use to implement requirements without programming. It never works out that way. First, instead of just implementing each new requirement in code normally like before, the programmer now has to implement the logic for each requirement in the “rules engine”, and then implement the requirement itself in the custom configuration language. Usually this (significant) extra effort is justified by dreaming of how amazing things will be when the rules engine finally supports all possible logic. But that never works out, either. As more and more logic is added to the rules engine, it looks more and more like a buggy and badly documented re-implementation of the programming environment, without any IDE or tooling support. No one but the original programmer is willing or able to configure it.

It’s inevitable. Once again, the only ways to simplify the programmer’s job are to a) identify and remove accidental complexity in the programming platform (e.g., switching from COBOL to a better language), or, b) sacrifice essential complexity (e.g., limiting the range of supported business requirements). But that’s not the point of the rules engine, so it just piles on complexity.

Inner platforms often have humbler beginnings. I recently came across yet another example in a government department that was building a system that needed to process government-issued IDs. The system handled a wide range of IDs from across the country, so it required a lot of forms for data entry, and building all these forms in code sounded inefficient and tedious. So, someone created a form generator engine configurable with a custom domain-specific language (DSL). It worked pretty well for about 80-90% of the problem, but then the developers had to deal with the many little quirks and exceptions that happen in the world of IDs. For example, a particular ID issued by a regional government only contained certain fields after the 1970s. Certain forms of ID were required for certain purposes, but these rules had to take into account whether those IDs even existed in particular times or places. Each quirk required extra logic to be hacked into the form generator, and pretty soon the project turned into a Zeno tarpit. No single developer understood more than a small part of the DSL (and most of the DSL was only understood by developers who’d left the project) so new forms were generated by copying existing DSL code for a similar form, editing it, and hoping for the best.

The team finally threw out the system and now build forms using normal code with composable modules and no added abstraction layers. They say it’s bliss.

The form generator wasn’t meant to be the 100%-configurable solution to all the world’s problems, but it hit the inner platform effect because there weren’t any clear boundaries to what logic it needed to support. As a counter-example, take regular expressions. Regular expressions have had some feature creep from their origins in formal language theory, but it’s still well accepted that there are limits to what regular expressions should support. If you ask a regex library developer to add new features to help you parse HTML, you’ll be told to use a different tool or face the pony.

APIs

A direct application of this kind of thinking is in API design. Some APIs are thick. They’re logic-heavy abstraction layers that mostly hide whatever’s underneath. Other APIs are thin. They don’t contain much logic, and the interfaces look a lot like the things being interfaced to. Obviously, a thick API is in danger of becoming an inner platform if it has to support essentially all the functionality of what it interfaces to, so thick APIs work best when they have reduced functionality, by design.

Take HTTP client libraries for example. Good ones tend to fall into two camps: those that support a feature like “download something from this URL” using a simple download() function, and those that support practically everything in HTTP using functions like get(), put() and post().

The litmus test when designing an interface like this is to ask, “What functionality will this not support?” If that’s hard to answer, there’d better be a lot of identifiable accidental complexity that the interface is fixing. Otherwise, any complexity in the abstraction layer will just add to the total complexity.

Software Verification

A good way to find bugs in software is to use redundancy. For example, when I try to implement a difficult algorithm, I like to also implement a totally different algorithm that does the same job, and compare the outputs for as much input as I can generate. Both implementations could be buggy, but it would be unlucky if all the bugs were identical. Variable and type declarations are another example of redundancy with bug-finding power.

But that’s not how everyone wants bug-finding to be — some people want a systematic way to verify that software is correct. I once did a very short contract at a branch office of a particular big multinational company. The managers at this office were sold on the idea that specification was the solution to software bugs. If engineers were made to specify every software component before writing any code, then any bugs could be blamed on a lack of discipline in following the spec. (The spec-writing phase of a project often lasted for multiple years.) Of course, the more the spec describes, the more complex it gets, and it can’t fully describe the entire software system without being at least as complex as the software system. So this thinking assumes that spec is inherently less bug-prone than code, but code can be statically checked, compiled, and run to reveal obvious bugs. Natural language spec can’t, and it’s rarely written with the same kind of rigour as code.

The spec documents at this office were full of bugs. There were outright contradictions between different parts of a document, as well as much subtler bugs, such as a component with a list of, say, four requirements, where any subset of three would be feasible, but all four together not.

What if they’d used computer-verifiable formal spec? They’d still have the same dilemma: it’s only effective as direct verification if your formal system is somehow inherently less bug-prone (i.e., has less accidental complexity than the program) or if it only describes a subset of the program functionality (i.e., has less essential complexity than the program). Fonseca, et al, studied three formally verified distributed systems and found 16 bugs. Unsurprisingly, most of the bugs found were in aspects of the systems that weren’t formally verified, but others were found in the formal specs, and in the automated testing frameworks themselves.

I’m not trying to single out specification here. This dilemma applies to any attempt at software verification. Things only go bad when a single tool or technique or practice is treated as the one true (if done “properly”) way to directly ensure software correctness. There’s no such thing.

And Everything Else

That old joke aside, complexity is the one really hard thing in software engineering. In an abstract way, practically everything can be thought of as a bunch of interacting components that each carry some amount of complexity. A lot can be figured out just by treating this complexity as a kind of “stuff” that gets transformed by every interface. This post is long enough already, so I hope the above examples make the case: looking at the complexity and banishing all magical thinking about what happens to it is key for good software design.