Exactly where you put caching in a distributed system has a significant impact on its effectiveness, in ways that aren’t always obvious during the design phase of development.

Three Architectures

I’ll split caching architectures into three broad types:

Three distributed system caching architectures
User or frontend requests come to the server from the left. A cache can sit between the server and client, or embedded inside the server, or alongside the server as a standalone service.

An intercepting cache acts as a gate for requests between the server and client. If the cache decides the request is a “hit”, the server never sees the request, and the user gets served a recycled response from the cache. All web services have to deal with this kind of caching – HTTP clients cache even if the server doesn’t. A cache between the application and a backend like a database is the same concept.

Embedded caches tend to be ad-hoc and application-specific. Many popular applications and frameworks have them because they’re such an easy way to get a quick performance boost – an embedded cache is often as simple as a hash table inside the code.

The standalone cache is a separate external server that’s a new backend. Logically, it works a lot like the embedded cache, but a cache lookup takes a remote procedure call (RPC) instead of just a local function call inside the code. Examples are servers like memcached and Redis.

Intercepting Caches: Easy Come, Easy Go

A great thing about intercepting caches is that they can often be bolted onto an existing system without any refactoring. You stick one into an existing communication link, and average response time goes down. Unfortunately, it doesn’t always go down by much if the server wasn’t designed with caching in mind.

Typically, as the application develops, requests get more complex and the rate of cache hits goes down. For example, a simple shopping website might be highly cacheable at first launch when product pages are just product pages. But when personalised greetings and product recommendations are added, a page generated for Alice can no longer be cached and served to Bob.

A lot of tuning and refactoring can be done to push that cache hit rate back up, and sometimes it pays good dividends. But ultimately, complex applications are hard to cache well this way.

Embedded Caches: A Technical Loan

A cache inside the application has a lot more opportunity to cache things because it doesn’t have to be all or nothing: even if all requests are subtly different, the embedded cache might still be able to exploit similarities between requests. This makes them very popular for boosting performance.

But consider what happens if the application needs to scale to meet higher traffic. Now we have multiple application servers sitting behind a load balancer:

Application servers with embedded caches sitting behind a load balancer

In an ideal world, N N servers would offer N N times the capacity as one. In the real world, it’s normally somewhat less than N N times, but hopefully still O ( N ) O(N) . In this case, however, things are much worse. Suppose a request comes in and gets processed by an application server. That server warms up its cache for that kind of request, but sadly there’s only a 1 / N 1/N chance that the next similar request will be sent to the same server by the load balancer. In some sense, caching performance is going down as servers are added.

The obvious “fix” is something like session stickiness – programming the load balancer so that requests for a user keep going to the same backend server. However, session stickiness is a pretty bad solution to performance problems. Load balancing is one of those things that sounds very simple, but turns out to be hard in the real world. Not “hard” as in “you must be super smart”, but as in “you must accept suboptimal performance in practice”. Session stickiness just makes things much, much harder, and hence even less efficient.

If that’s not bad enough, there’s worse. So far I’ve been ignoring things like cache coherency, but let’s look at it now. With each server containing its own little cache, every server will probably end up talking to every other one to make sure it has a consistent view of the world. What that means is that for N N servers, the number of communication channels to maintain the caches is O ( N 2 ) O(N^2) , so performance scaling in this regard is 1 / O ( N 2 ) 1/O(N^2) . Needless to say, this is pretty bad.

Application servers with embedded caches doing all-to-all communication for cache coherency

In short, embedded caches offer easy short-term gains, but charge heavy interest if you need to scale up.

External Caches: With Great Power Comes Great Complexity

Strictly speaking, cache coherency doesn’t have to be O ( N 2 ) O(N^2) , but any better solution is going to be more complex than just taking the caches out of the application servers. The major benefit is that you simply don’t need so many cache servers as application servers, so the above problems become more manageable or disappear. This might sound like cheating, but it’s effective. Caches are much faster than your application (otherwise there’d be no point caching) so, depending on your application, you might have a few cache servers supporting over a dozen application servers.

Another practical advantage is that generic, off-the-shelf cache servers are better engineered than typical ad-hoc hash-table-that-grew-legs embedded cache implementations. Look for features like persistence, replication and monitoring stats.

Standalone caches are definitely better than embedded caches for high-traffic systems, but they do have some major costs. Running multiple servers just makes everything more complex – you need to figure out service discovery, health checking and load balancing, as well as how to create dev and testing environments. If you’re running a high-traffic site, you need to do this stuff anyway, but it’s pain for no gain for small sites where a single, unified package is enough.

It’s not just the infrastructure that gets more complicated, either. When your application is one package, it can be deployed and updated as one package. In particular, you can atomically upgrade an application server with its cache (albeit with a cold cache on startup). If you have external cache servers shared by multiple application servers, it’s totally plausible to have data stored in the cache by version N N of the application get used by another instance at version N + 1 N+1 (and vice versa). So your application and its cache data need to have some kind of forward and backward compatibility for at least a single version. Again, this is the kind of thing you’re forced to deal with anyway at large scale, but there’s no point adding this kind of complexity at small scale.

Both standalone and embedded caches have a serious disadvantage compared to intercepting caches: unlike intercepting caches, they don’t bring server load to zero when there’s a cache hit. Even for a site where an intercepting cache has lower hit rate during normal traffic, the intercepting cache can save the day if there’s a major traffic spike because traffic spikes are commonly highly focussed on a small number of cacheable hot requests.


All three caching architectures have major pros and cons, and they all have a place in the distributed systems toolbox.