The Programmer's Paradox: Normalization Revisited

Over 10 years ago, I wrote 3 rather rambling posts on ideas about normalization:

It’s not a subject that has ever grabbed a lot of attention, but it’s really surprising that it didn’t.

Most code out there is ‘legacy code’. That is, it was written a while ago, often by someone who is gone now. Most projects aren’t disciplined enough to have consistent standards, and most often any medium-sized or larger system consists of lots of redundant code, a gazillion libraries, endless spaghetti, broken configurations, no real documentation, etc.

Worse, a lot of code getting written right now depends on this pre-existing older stuff and is already somewhat tainted by it. That is, a crumbling code base always gets worse with time, not better.

These days we use good automated tooling like gofmt (in Golang) or rubocop (in Ruby) for different languages that are capable of enforcing light standards by automatically fixing the syntax, often they are set to reformat during saving in the editor. This lets programmers be a bit sloppy in their coding habits but auto-corrects it before it stays around for any length of time.

Just putting some of these formatting tools into play in the editors is a big help in enforcing better consistency, getting better readability, and thus better overall quality.

What does that have to do with normalizations? The idea behind normalizing things is that much like in a relational database, there is a small set of rules that control relationship properties. In a database, it is applied to structural issues for the data. In code, it can also be applied to structural issues.

The execution of code is a serialized list of instructions that each processor in the computer follows, but we often see it as a “tree” of function calls. To do this each function is a node, calling all of the other functions as children. Mostly, it looks like a tree, but since we can have reuse, and there can be infinite loops, it’s really a directed graph. A stack dump then is just one specific ‘path’ through this structure. If we dumped the stack a lot of times and combined it together we’d get a more complete picture.

We can take any of the instructions in this list of execution steps and move them ‘up’ or ‘down’ into other function calls in the graph. We can break up or reassemble different function calls, nodes, so that the overall structure has properties like symmetry, consistent levels, and encapsulated component structure.

It is probably incredibly slow to take a lot of code and rework it into a ‘semantic’ structure, but once that is in place, it can be shifted around based on the required normalizations, then returned to the source language. It might require a lot of disk space and a lot of CPU, but the time and resources it takes are basically irrelevant. Even if it needed a few weeks and 100 Gb, it would still be incredibly useful.

The idea is that if you take a huge pile of messy code, and set up a reasonable number of ‘non-destructive’ normalized refactors, it would just go off and grind through the work, cleaning up the code. It absolutely needs to be non-destructive so that the final work is trustworthy (or at least as trustworthy as the original code), and it needs to be fully batched, to avoid a lot of user interaction.

You could literally set it running, go off for a two-week vacation, and come back to a trustworthy, cleaned up codebase, that had exactly 0 operational differences from the mess you had before you left. You could just flip that into production right away, without even having to do regression testing.

A well-set group of rules could clean up variable naming, shift around the layers to unwind spaghetti, put a scope on globals, and even put the basis there for commenting. As well, it could identify redundant code, even if it doesn’t do the merge itself, and it could definitely identify useless data transformations, missing error handling, inconsistent calls, and a huge number of trivial and obvious problems. Basically, it’s a code review on steroids but applied across the entire system, all at once.

It could create menial unit tests, redo the configurations, shift behavior from one type of resource to another (config file -> database, or vice versa). Basically a huge number of super useful cleanup work that programmers hate to do and procrastinate on doing until it causes grief.

To deal with interactivity, suggested changes could be put into comment blocks. So, it generates a better method in the code, comments it out, but with the same name as the replacement method. Much like dealing with SCM merges, you could read both versions, then pick the better one (and add some type of test to vet it properly).

Once the code is normalized, it takes considerably less cognitive effort to read through it and see the problems. Most of them will be obvious. Starting something, without closing it. Doing weird translations on the data, not checking for errors, and not doing any validations on incoming data. Most bugs fall into these categories. It could also help identify race conditions, locking issues, and general resource usage. For instance, if you normalize the code and find out that it has 4 different objects to cache the same user data, it’s a pretty easy fix to correct that. Bloat is often a direct consequence of obvious redundancies.

You could look at the resulting code right away, or just leave it to explore on an as-needed basis. Since it’s all consistent and follows the same standards, it will be way easier to extend.

It would also be useful if it has an idiom-swapping mechanism, for languages where there are too many different ways to accomplish the same thing. If there are 4 ways to doing string fiddling, then swapping the other 3 down to 1 consistent version helps a lot for readability.

What kills a lot of development projects is that the code eventually becomes so messy that people are afraid to change it. Once that happens, either it goes crazy as an onion architecture (new stuff just wraps around the old stuff with lots of bugs) or it slows down into maintenance mode. Either way, the trajectory of the codebase is headed downwards, often quite rapidly.

If all it took to get out of that death spiral was to spend a week fiddling with the configuration and then a week of extreme processing, that would make a huge difference.

Code is encapsulated knowledge about the domain and/or technical problems. That is a lot of knowledge, captured over time, of intermittent quality, and it would be a huge amount of work to go backward, throwing all of it away and starting over. Why? If you can take what was already there, already known, and leverage it, but not as an unreadable opaque box, rather a good codebase that is extendable, then life just got easier. People are somewhat irrational, programmers all have different styles, and as a big ‘stew’ of code, it is overwhelming. If you can just normalize it all into something you want to work on, then most of those historic influences have been mitigated.

The Programmer's Paradox

Tuesday, July 14, 2020

Normalization Revisited

No comments:

Post a Comment