Thursday, August 31, 2023

Patching is not Programming

One of the best skills to have in software development is the ability to jump into some code, find, and then fix an error it is having.

Every software needs bug fixing, being able to do it quickly and reliably is a great skill.

But ...

Programming is not bug fixing. They are complete different. You can, of course, just toss in some code from StackOverflow, then bug fix it until it seems to be doing what it needs to do. That is a common strategy. But it is not programming.

In programming, you go the other way. You take the knowledge you understand, and you carefully lay out the mechanics to make that reality a thing. What is important is how you have laid out the code in an organized manner with an expectation for all of it’s possible behaviours.

Then later you bug fix it.

Thursday, August 24, 2023

Cache Invalidation Problem

In an earlier post, when I suggested that it was unethical for computers to lie to their users, someone brought up the “cache invalidation” problem as a counterpoint.

Cache invalidation is believed to be a hard problem to solve. If you cache any type of data, how do you know when to invalidate and remove that data from the cache?

Their argument was that because it is believed to be hard (but not impossible), it is then acceptable for any implementation of caching to send out the wrong data to people. That ‘hard’ is a valid excuse for things not working correctly.

As far as difficult programming problems go, this one is mostly self-inflicted. That is, it is often seen as a hard problem simply because people made it a hard problem. And it certainly isn’t impossible like the Two Generals’ Problem (TGP).

Essentially, if you don’t know how to cache data correctly, then don’t cache data. Simple.

“We have to cache to get performance”

In some cases yes, but for most of my life, I have been ‘removing’ caching from applications to get performance. Mostly because the caching was implemented so badly that it would never work.

There are two very common problems with application caching:

First: Caching everything, but never emptying the cache.

A cache is a fixed-sized piece of memory. You can’t just add stuff without deleting it. A well-written cache will often prime (preload values) and then every miss will cause an add/delete. It requires a victim selection approach such as LRU.

So, yeah, putting dictionaries everywhere and checking them for previous things is not caching, it is just crude lookup tables and you have to be extra careful with those. You can stick those in utilizing memoization for long-running algorithms, but you cannot use them as cheap-and-easy fake caching.

If you add stuff and never get rid of it, it isn’t caching it is just intentional/accidental leaking.

Second: Caching doesn’t even work on random or infrequent access.

For caching to save time, you have to have a ‘hit’ on the cache. Misses eat more CPU, not less. Misses are bad. If 90% of the time you get a miss, the cache is wasting more CPU than it is saving. You can increase the size of the cache, but priming it just takes longer.

So caching really only works on common data that is used all of the time. You can’t cache everything, and it probably follows the Parento rule or something, so maybe less than 20% of the data you have in a system would actually benefit from caching.

The core point though is if you don’t know the frequency and distribution of the incoming requests, you are not ready to add in caching. You’re just taking wild guesses and those guesses will mostly be wrong.

“We have to cache to reduce throughput”

This is the other side of the coin, which shows up when scaling. You use caching to reduce the throughput for a lot of clients. In this case, it doesn’t matter if accessing the data is slower, so long as the requests for data are reduced. It is not a problem for small, medium, and even some large systems. It’s a special case.

General caching really only works correctly for static resources. If the resource changes at all, it should not be cached. If the resource could be different, but you don’t know, then you cannot cache it.

If you do something like web browsers where you do allow caching, but provide a refresh action to override it, you’re mostly just annoying people. People keep hitting shift-refresh just in case. All the time. So the throughput isn’t less, it is just delayed and can even be more.

“We can’t know if the data we cached has been changed ....”

There are 2 types of caches. A read-only cache and a write-through cache. They are very different, they have different usages.

A read-only cache holds data that never changes. That is easy. For however you define it, there is a guarantee out there, somewhere, that the data will not change (at least for a fixed time period which you know). Then you can use a read-only cache, no other effort is needed.

Although it is read-only it is also a good ‘best practice’ to add in a cache purge operational end-point anyways. Guarantees in the modern world are not honored as they should and someone may need to ask ops to purge the cache in an emergency.

If you do have data that may occasionally change after it has been cached, there are a couple of ways of handling it.

The worst one is to add timeouts. It sort of works for something like static HTML pages or CSS files, but only if your releases are very infrequent. You can get a bit of resource savings at the cost of having short periods of known buggy behavior. But it is a direct trade-off, for up to the entire length of the timeout remaining after a change, the system is possibly defective.

It may also be important to stagger the timeouts if what you are trying to do is ease throughput. Having everything timeout in the middle of the night (when you expect data refreshes) doesn’t help with an early morning access spike. If most people use the software when they first get into work, that approach is mostly ineffective.

It makes no sense at all though if most of the caching attempts miss because they have timed out earlier. if you have a giant webapp, with a short timeout and most traversals through the site are effectively random walks, then the hits on the overlap are drowned out by timing and space constraints. If you bypass the memory constraints by utilizing the disk, you will save a bit on throughput, but the overall performance suffers. And you should only do this with static pages.

There were days in the early web were people blindly believed that caching as much as possible was the key to scaling. That made quite a mess.

The second worst way of handling changing data is to route around the edge. That is, set up a read-only cache, have a secondary way of updating the data, then use events to trigger a purge on that entry, or worse, the entire cache. That is weak, in that you are leaving a gaping hole (aka race condition) between the update time and the purge time. People do this far too often and for the most part, the bugs are not noticed or reported, but it doesn’t make it reasonable.

The best way of dealing with changeable data is having a ‘write-through’ cache that is an intermediary between all of the things that need the cached data and the thing that can update it. That is, you push down a write-through cache low enough so everybody uses it, and the ones that need changes do their updates directly on the cache. The cache receives the update and then ensures that the database is updated too. It is all wired with proper and strict transactional integrity, so it is reliable.

“We can’t use a write-though cache, it is a different team that does the updates”

That is an organizational or architectural problem, not a caching one. There is a better way to solve this, but the way you are arranged, half of the problem falls on the wrong people.

At very worst set up an update event stream, purge individual items, and live with the race condition. Also, provide an operational end-point to purge everything, as you’ll need it sometimes.

If it can change and you won’t ever know when it changes, then you should not cache it.

“We have 18 copies of the data, and 3 of them are actively updated”

That is just insane. What you have is a spaghetti architecture with massive redundancy and an unsolvable merge problem. About the only thing you should do here is to not cache. Fix your overall architecture, clean it up, and do not try to bypass ‘being organized’ with invalid hacks.

It’s worth pointing out again, that if you can not cache correctly, you should not cache. Fixing performance should not come at the cost of destroying integrity. Those are not real performance fixes, just dangerous code that will cause grief.

To summarize, there isn’t really a cache invalidation problem. You are not forced to lie. Yes, it can be hard to select a victim in a cache to purge. But is hard because the implementations are bad, or the organization is chaotic, or the architecture is nonexistent. It is hard because people made it hard, it is not a hard problem in general. You don’t have to cache, and if you do, then do it properly. If you can’t do it properly minimize the invalid periods. Make sure that any improvements are real and they do not come at the cost of throwing around stale, incorrect data.

Thursday, August 17, 2023

Development Workstations

Programmers write code.

They need complicated tools to do this properly.

They should have fast machines with lots of resources. Anyone who thinks stiffing their programmers with underpowered machines is a good idea is mistaken. You’re just wasting your own money if the coders can’t be their jobs effectively. Money spent on hardware for coders is always well spent.

Development machines have access to a lot of tools. They need these tools to be able to do their jobs efficiently. By inefficient, I don’t mean it will take 20% longer, I mean it will be a multiplier like 3x or worse. If programmers can download the tools they need, as they need them, they can flow through the work and get the job done. Delays and disruptions are crazy expensive.

There is no absolute way to predict all of the tools. The software industry is too vast and the range of issues is too volatile. You can’t just make a list of the tools. It doesn’t work that way, it has never worked that way, and anytime someone has tried to enforce such a restrictive viewpoint, it hurt development efforts, crushed morale and resulted in programmers jumping ship. When that happens, it can take a very long time for a company to recover, as their reputation amongst the programming community tends to take a rather massive hit.

There are two ways to deal with the programmers getting in outside tools. Both are probably necessary.

The first is to segregate them on their own network. In that way, if one of their tools is problematic, the scope of the damage is controlled.

The other is to set up the network so that only some of the programmers can bring in outside tools. You make it a senior responsibility, but still one that is handled by the programmers. A nonprogrammer may reject using some tools simply because they do not understand them, so it needs to be someone who has the correct experience and knowledge to be able to make an informed decision.

As the toolset increases, they should be kept internally somewhere. That is, most programmers when they need a new tool or a version of the tool, should go to the internal tool repository and get it from there. It should be easy and convenient to get tools from there, and after a while, most of the major tools will be there. There should also be some documentation to list out the ‘best’ version of any tool that the coders should use.

If you do all of those things correctly, then the developers can have minimum constraints on them as to how they use their machines. That is, they have admin rights, and any preventive software is minimized.

A clear sign that an organization does not understand “programming” is when they insist that programmers aren’t administrators of their own machines. Getting programmers mixed up with power users is a common, but really bad mistake. Programmers should program. Programming involves messing with stuff. Messing with stuff can be dangerous. If you stop them from messing with stuff, you cripple their ability to program, and their output tends to degrade to unusable quality. That is, either you empower people to do their jobs, or you accept that their jobs will not be done poorly. You can’t have it both ways.

Fully disempowered programmers are just typists. You don’t need typists, you need programmers to build you stuff that is good enough to solve your problems.

A poor worker may blame their tools, but a gifted worker without reasonable tools is a waste.

Thursday, August 10, 2023

Complexity and Principles

In the midst of overwhelming complexity, the best option is to pull up slightly.

As we are human, things can get far too messy and complicated, so even if we are keeping up with it, the sheer size of the scope will cause us to be rampantly inconsistent. Those inconsistencies themselves are more extra complexity that is getting added as a byproduct of our not being able to cope with the intrinsic complexity. So it is getting worse.

While the ‘devil’ is in the details, trying to deal with the details at their lowest level is a losing game. So, instead, we want to go a little higher and look for patterns that lay out across similar but different cases.

Instead of getting caught up in the abstract nature of patterns though, we can think of these overarching structural constraints as ‘principles’. Then the higher-level principles should constrain the lower-level details.

Then you might have a great sea of details, but you have a fairly small set of principles above that keep the details in place. Then it becomes easy to fix things. If some specific special case violates the higher principle, you adjust it accordingly. Then to be consistent, you don’t have to remember all of the different cases, just the binding principles. With both, you have a way of moving forward and improving the work, without having to exceed your cognitive limits.

If at some point the principles themselves become too much, then we just do the same thing again and go up another level to meta-principles, to ensure that our principles are aligned and consistent.

Now this is not the same as trying to find top-down abstractions.

This is a bottom-up approach, where you start with the details first. When you see enough of them, you try to align their similarities into principles. The key part though is that you recognize and accept that the different groups of details should share common patterns, and so it is worthwhile looking for those patterns.

But it is worth noting that if the principles are not obvious, then creatively overlying fake patterns on top is not good. If there are no obvious patterns, then there are no principles that bind things together. We don’t want principles for the sake of having principles, that would just be extra complexity, we want them for the sake of simplifying stuff and making it consistent to allow us to reduce complexity. So insanely complicated principles are not actually principles, they are just artificial meta-details, that in themselves should be dropped.

In that way, along with ‘encapsulation’ we can tackle overwhelming complexity without getting completely overwhelmed. You cannot fix complicated things by ignoring their nature with over-simplified approaches, that will usually make it worse. If something is complex, you first have to accept that it is complex. You have to deal with it as a complex entity and work within those boundaries.

Thursday, August 3, 2023

Usablity Tradeoff

Software graphical user interfaces are funny things.

For anything you’d like to do there is a whole spectrum of possible interfaces. On each screen, you can have various different types of modes. You can overlay all sorts of navigational paradigms on top. There are plenty of different widgets and lots of different styling tools. When you add up all of the permutations the possibilities feel endless.

But they are not.

For one thing, user interfaces are highly subject to current trends. A new interface either fits in nicely with the ones around it or it clashes. Its alignment may be at the higher levels, often referred to as ‘look’ and ‘feel’, or it can do it at the lower levels with how it uses the underlying widgets.

So, one rather counter-intuitive property of interfaces is that if they are easy to program, then they are hard to use. It’s not a strict tradeoff, but it is pretty close to it.

If you hire a graphic designer, they will make your interface look great.

If you don’t, it is incredibly difficult to make it look great and most people aren’t naturally gifted at doing that, it takes experience. Ugly interfaces increase the cognitive load of the user, they basically have to expend energy to ignore how annoying the thing is. It doesn’t ‘look’ great.

The ‘feel’ part is similar, usually the domain of a UX expert. Really it is how the navigation maps back to the user’s workflow. Obviously, poor or awkward mappings make it harder for the user to get around and do what they need to do. Instead of it being intuitive, they have to constantly remember that for ‘this interface’ they do some weird steps. So again it is an extra cognitive load for the users.

Interfaces with poor ‘feel’ also usually end up needing a lot of training as well. The users can’t just sit down and do what they need to do, instead, it is always a puzzle of some sort. That makes the project longer too.

For any large interface, a lot of work goes into initially setting up the UX, but as more and more functionality gets added, there is also a lot of work in keeping it relevant.

Then we get down to widgets.

Pretty much every platform provides the same basic ones. Some platforms provide even more. They all have an appropriate use. For example, if there is a password widget, then using it instead of a text widget is preferable.

And there are collected paradigms like dynamic trees. Lots of frameworks have some tree support, but they usually only work for small static trees, huge dynamic trees are a lot of work to get correctly implemented. Retrofitting paging back into trees and lists can be painful.

Layouts with widgets are often tricky as well. You can use simple ones. but they don’t look appealing. A good layout that matches a strong graphic design is often a complicated nest of different containers and layout managers, sometimes 3 or even 4 levels deep.

Now it is far easier for a programmer to ignore the graphic design and UX requirements, and toss the widgets around instead of using them properly. As well, it is extraordinarily difficult to sort out the UX mapping issues, as it always requires some foresight into how the application will grow.

So if you were going to produce a really slick interface, it is actually a huge amount of work. If you skimp on that work, it’s not that hard to wire up some widgets that do some stuff, but it will never make the users happy. So, it is a pretty direct tradeoff, just a little different because, for graphic design and UX, you would hire outside help, so it’s not harder in those cases for the programmers, it is just more expensive for the project.