The Programmer's Paradox: Redundancies

If a computer program is useful, at some point it will get complicated.

That complication comes from its intrinsic functionality, but more often it is amplified by redundancies.

If you want to delay getting lost in complexities for as long as possible, then you want to learn how to eliminate as many redundancies as you can. They occur in the code, but they can also occur in data as well.

The big trick in reducing redundancies is to accept that ‘abstractions' are a good thing. That’s a hard pill for most programmers to swallow, which is ironic given that all code, and the tech stacks built upon code, are rife with abstractions. They are everywhere.

The first thing to do is to understand what is actually redundant.

Rather obviously if there are two functions in the code, with different names, but the code is identical they are redundant. You can drop one and point all of its usages to the other one.

If there are two functions whose inputs are “similar” but the outputs are always the same, that is redundant as well. If most of the inputs are the same, and the outputs are always the same, then at least one of the inputs is not used by the code, which can be removed. If one of the functions is a merge of two other functions, you can separate it, then drop the redundant part.

The more common problem is that the sum of the information contained in the inputs is the same, yet the inputs themselves are different. So, there are at least 2 different ways in the code to decompose some information into a set of variables. It’s that decomposition that is redundant and needs to be corrected first. The base rule here is to always decompose any information to the smallest possible variables, as early as possible. That is offset by utilizing whatever underlying language capabilities exist to bind the variables together as they move throughout the code, for example wrapping a bunch of related variables in an object or a structure. If you fix the redundant variable decompositions, then it’s obvious that the affected functions are redundant and some can be removed.

As you are looking at the control flow in the code you often see the same patterns repeating, like the code goes through a couple of ‘for’ loops, hits an ‘if’, then another ‘for’ loop. If all of the variables used by this control flow pattern are the exact same data-type (with the same constraints) then the control flow itself is redundant. Even if the different flows pass through very different combinations of functions.

In a lot of programs we often see heavy use of this type of flow redundancies in both the persistent and GUI code. Lots of screens look different to the users, but really just display the same type or structure of data. Lots of synchronization between the running code and the database is nearly identical. There are always ‘some’ differences, but if these can be captured and moved up and out of the mechanics, then lots of code disappears and lots of future development time is saved.

We can see this in the data that is moving around as well. We can shift the name of a column in the database from being a compile-time variable over to being a runtime string. We can shift the type to being generic, then do late binding on the conversion if and only if it is needed somewhere else. With this, all of the tables in the database are just lists of maps in the running code. Yes, it is more expensive in terms of resources, but often that difference is negligible. This shift away from statically coding each and every domain-based variable to handling them all dynamically usually drops a massive amount of code. If it’s done consistently, then it also makes the remaining code super-flexible to changes, again paying huge dividends.

But it’s not just data as it flows through the code. It’s also the representation of the ‘information’ collected as it is modeled and saved persistently. When this is normalized, the redundancies are removed, so there is a lot less data stored on disk. Less data is also moved through the network and code as well. Any derived values can be calculated, later, as needed. Minimizing the footprint of the persisted data is a huge time saver. It also prevents a lot of redundant decomposition bugs as well. In some cases, it is also a significant performance improvement.

This also applies to configuration data. It’s really just the same as any domain data, but sometimes it’s stored in a file instead of in a database. It needs to be normalized too, and it should be decomposed as early and deeply as possible.

Redundancies also show up in the weirdest of places.

The code might use two different libraries or frameworks that are similar but not actually the same. That’s code duplication, even if the authors are not part of the project. Getting rid of one is a good choice. Any code or data ‘dependencies’ save time, but are also always just problems waiting to happen. They only make sense if they save enough effort to justify their own expense, throughout the entire life of the project.

Redundancies can occur in documentation as well. You might have the same similar stuff all over the place, in a lot of similar documents, which generally ages really badly, setting the stage for hardships and misunderstandings.

Processes can be highly redundant as well. You might have to do 12 steps to set up and execute a release. But you do this redundantly, over and over again. That could be scripted into 1 step, thus ensuring that the same process happens reliably, every time. With one script it’s hard to get it wrong, and it’s somewhat documenting the steps that need to be taken.

The only redundancies that are actually helpful are when they are applied to time. So, for example, you might have saved all of the different binaries, for all of the different releases. It’s a type of insurance, just in case something bad happens. You have a ‘record’ of what’s been released with the dates. That can be pruned to save space, but a longer history is always better. This applies to the history in the repo and the documentation history as well. When something bad happens, you can figure out where it started to go wrong, and then do some work to prevent that from reoccurring next time, which usually saves a lot of support effort and helps convince the users that the code really is trustworthy.

In programming, there is never enough time. It’s the one consistent scarce resource that hasn’t changed over the decades. You can take a lot of foolish short-cuts to compensate, but it’s a whole lot better if you just stop doing redundant work all of the time. Once you’ve figured out how to do something, you should also figure out how to leverage that work so you never have to figure it out again.

The Programmer's Paradox

Saturday, May 8, 2021

Redundancies

No comments:

Post a Comment