Thursday, August 24, 2023

Cache Invalidation Problem

In an earlier post, when I suggested that it was unethical for computers to lie to their users, someone brought up the “cache invalidation” problem as a counterpoint.

Cache invalidation is believed to be a hard problem to solve. If you cache any type of data, how do you know when to invalidate and remove that data from the cache?

Their argument was that because it is believed to be hard (but not impossible), it is then acceptable for any implementation of caching to send out the wrong data to people. That ‘hard’ is a valid excuse for things not working correctly.

As far as difficult programming problems go, this one is mostly self-inflicted. That is, it is often seen as a hard problem simply because people made it a hard problem. And it certainly isn’t impossible like the Two Generals’ Problem (TGP).

Essentially, if you don’t know how to cache data correctly, then don’t cache data. Simple.

“We have to cache to get performance”

In some cases yes, but for most of my life, I have been ‘removing’ caching from applications to get performance. Mostly because the caching was implemented so badly that it would never work.

There are two very common problems with application caching:

First: Caching everything, but never emptying the cache.

A cache is a fixed-sized piece of memory. You can’t just add stuff without deleting it. A well-written cache will often prime (preload values) and then every miss will cause an add/delete. It requires a victim selection approach such as LRU.

So, yeah, putting dictionaries everywhere and checking them for previous things is not caching, it is just crude lookup tables and you have to be extra careful with those. You can stick those in utilizing memoization for long-running algorithms, but you cannot use them as cheap-and-easy fake caching.

If you add stuff and never get rid of it, it isn’t caching it is just intentional/accidental leaking.

Second: Caching doesn’t even work on random or infrequent access.

For caching to save time, you have to have a ‘hit’ on the cache. Misses eat more CPU, not less. Misses are bad. If 90% of the time you get a miss, the cache is wasting more CPU than it is saving. You can increase the size of the cache, but priming it just takes longer.

So caching really only works on common data that is used all of the time. You can’t cache everything, and it probably follows the Parento rule or something, so maybe less than 20% of the data you have in a system would actually benefit from caching.

The core point though is if you don’t know the frequency and distribution of the incoming requests, you are not ready to add in caching. You’re just taking wild guesses and those guesses will mostly be wrong.

“We have to cache to reduce throughput”

This is the other side of the coin, which shows up when scaling. You use caching to reduce the throughput for a lot of clients. In this case, it doesn’t matter if accessing the data is slower, so long as the requests for data are reduced. It is not a problem for small, medium, and even some large systems. It’s a special case.

General caching really only works correctly for static resources. If the resource changes at all, it should not be cached. If the resource could be different, but you don’t know, then you cannot cache it.

If you do something like web browsers where you do allow caching, but provide a refresh action to override it, you’re mostly just annoying people. People keep hitting shift-refresh just in case. All the time. So the throughput isn’t less, it is just delayed and can even be more.

“We can’t know if the data we cached has been changed ....”

There are 2 types of caches. A read-only cache and a write-through cache. They are very different, they have different usages.

A read-only cache holds data that never changes. That is easy. For however you define it, there is a guarantee out there, somewhere, that the data will not change (at least for a fixed time period which you know). Then you can use a read-only cache, no other effort is needed.

Although it is read-only it is also a good ‘best practice’ to add in a cache purge operational end-point anyways. Guarantees in the modern world are not honored as they should and someone may need to ask ops to purge the cache in an emergency.

If you do have data that may occasionally change after it has been cached, there are a couple of ways of handling it.

The worst one is to add timeouts. It sort of works for something like static HTML pages or CSS files, but only if your releases are very infrequent. You can get a bit of resource savings at the cost of having short periods of known buggy behavior. But it is a direct trade-off, for up to the entire length of the timeout remaining after a change, the system is possibly defective.

It may also be important to stagger the timeouts if what you are trying to do is ease throughput. Having everything timeout in the middle of the night (when you expect data refreshes) doesn’t help with an early morning access spike. If most people use the software when they first get into work, that approach is mostly ineffective.

It makes no sense at all though if most of the caching attempts miss because they have timed out earlier. if you have a giant webapp, with a short timeout and most traversals through the site are effectively random walks, then the hits on the overlap are drowned out by timing and space constraints. If you bypass the memory constraints by utilizing the disk, you will save a bit on throughput, but the overall performance suffers. And you should only do this with static pages.

There were days in the early web were people blindly believed that caching as much as possible was the key to scaling. That made quite a mess.

The second worst way of handling changing data is to route around the edge. That is, set up a read-only cache, have a secondary way of updating the data, then use events to trigger a purge on that entry, or worse, the entire cache. That is weak, in that you are leaving a gaping hole (aka race condition) between the update time and the purge time. People do this far too often and for the most part, the bugs are not noticed or reported, but it doesn’t make it reasonable.

The best way of dealing with changeable data is having a ‘write-through’ cache that is an intermediary between all of the things that need the cached data and the thing that can update it. That is, you push down a write-through cache low enough so everybody uses it, and the ones that need changes do their updates directly on the cache. The cache receives the update and then ensures that the database is updated too. It is all wired with proper and strict transactional integrity, so it is reliable.

“We can’t use a write-though cache, it is a different team that does the updates”

That is an organizational or architectural problem, not a caching one. There is a better way to solve this, but the way you are arranged, half of the problem falls on the wrong people.

At very worst set up an update event stream, purge individual items, and live with the race condition. Also, provide an operational end-point to purge everything, as you’ll need it sometimes.

If it can change and you won’t ever know when it changes, then you should not cache it.

“We have 18 copies of the data, and 3 of them are actively updated”

That is just insane. What you have is a spaghetti architecture with massive redundancy and an unsolvable merge problem. About the only thing you should do here is to not cache. Fix your overall architecture, clean it up, and do not try to bypass ‘being organized’ with invalid hacks.

It’s worth pointing out again, that if you can not cache correctly, you should not cache. Fixing performance should not come at the cost of destroying integrity. Those are not real performance fixes, just dangerous code that will cause grief.

To summarize, there isn’t really a cache invalidation problem. You are not forced to lie. Yes, it can be hard to select a victim in a cache to purge. But is hard because the implementations are bad, or the organization is chaotic, or the architecture is nonexistent. It is hard because people made it hard, it is not a hard problem in general. You don’t have to cache, and if you do, then do it properly. If you can’t do it properly minimize the invalid periods. Make sure that any improvements are real and they do not come at the cost of throwing around stale, incorrect data.

2 comments:

Thanks for the Feedback!