The Programmer's Paradox: December 2024

Friday, December 20, 2024

Datasource Precedents

A common problem in all code is juggling a lot of data.

If you are going to make your code base industrial-strength one key part of that is to move any and all static data outside of the code.

That includes any strings, particularly since they are the bain of internationalization.

That also includes flags, counters, etc.

A really strong piece of code has no constant declarations. Not for the logging, nor the configuration, or even for user options. It still needs those constants, but they come from elsewhere.

This goal is really hard to achieve. But where it has been mostly achieved you usually see the code lasting a lot longer in usage.

What this gives you is a boatload of data for configurations, as well as the domain data, user data, and system data. Lots of data.

We know how to handle domain data. You put it into a reliable data store, sometimes an RDBMS, or maybe NoSQL technology. We know that preventing it from being redundant is crucial.

But we want the same thing for the other data types too.

It’s just that while we may want an explicit map of all of the possible configuration parameters, in most mediums whether that is in a file, the environment, or at the cli, we usually want to amend a partial set. Maybe the default for 10,000 parameters is fine, but on a specific machine we need to change two or three. This changes how we see and deal with configuration data.

What we should do instead is take the data model for the system to be absolutely everything that is data. Treat all data the same, always.

Then we know that we can get it all from a bunch of different ‘sources’. All we have to do is establish a precedence and then merge on top.

For example, we have 100 weird network parameters that get shoved into calls. We put a default version of each parameter in a file. Then when we load that file, we go into the environment and see if there are any overriding changes, and then we look at the command line and see if there are any more overriding changes. We keep some type of hash to this mess, then as we need the values, we simply do a hash lookup to get our hands on the final value.

This means that a specific piece of code only references the data once when it is needed. That there are multiple loaders that write over each other, in some form of precedence. With this it is easy.

Load from file, load on top from env, load on top from cli. In some cases, we may want to load from a db too (there are fun production reasons why we might want this as the very last step).

We can validate the code because it uses a small number of parameters all in the place they are needed. We can validate the default parameters. We can write code to scream if an env var or cli param exists but there are no defaults. And so forth. Well structured, easy to understand, easy to test, and fairly clean. All good attributes for code.

The fun part though is that we can get crazier. As I said, this applies to all ‘data’ in the system, so we can span it out over all other data sources, and tweak it to handle instances as well as values. In that sense, you do something fun like use a cli command with args that fill in the raw domain data, so that you can do some pretty advanced testing. The possibilities are endless, but the code is still sane.

More useful, you can get some large-scale base domain data from one data source, then amend it with even more data from a bunch of other data sources. If you put checks on validity and drop garbage, the system could use a whole range of different places to merge the data by precedence. Start with the weakest sources first, and use very strong validation. Then loosen up the validation and pile other sources on top. You’ll lose a bit of the determinism overall, but offset that with robustness. You could do it live, or it could be a constant background refresh. That’s fun when you have huge fragmented data issues.

Thursday, December 12, 2024

In Full

One of the key goals of writing good software is to minimize the number of lines of code.

This is important because if you only needed 30K lines, but ended up writing 300K, it was a lot of work that is essentially wasted. It’s 10x the amount of code, 10x the amount of testing, and 10x the bugs. As well, you have to search through 10x lines to understand or find issues.

So, if you can get away with 30k, instead of 300K, then is it always worth doing that?

Well, almost always.

30K of really cryptic code that uses every syntactic trick in the book to cleverly compress the code down to almost nothing is generally unreadable. It is too small now.

You always have to revisit the code all of the time.

First to make sure it is reusable, but then later to make sure it is doing the right things in the right way. Reading code is an important part of keeping the entire software development project manageable.

So 30K of really compressed, unreadable stuff, in that sense is no better than 300K of really long and redundant stuff. It’s just a similar problem but in the opposite direction.

Thus, while we want to shrink down the representation of the code to its minimum, we oddly want to expand out the expression of that code to its maximum. It may seem like a contradiction, but it isn’t.

What the code does is not the same as the way you express it. You want it to do the minimum, but you usually want the expression to be maximized.

You could for instance hack 10 functions into a complex calling chain something like A(B(C(D(E(F(G(H(I(K(args))))))) or A().B().C().D().E().F().G().H().I().K(args) so it fits on a single line.

That’s nuts, and it would severely impair the ability of anyone to figure out what either line of code does. So, instead, you put the A..K function calls on individual lines. Call them one by one. Ultimately it does the same thing, but it eats up 10x lines to get there. Which is not just okay, it is actually what you want to do.

It’s an unrealistic example since you really shouldn’t end up with 10 individual function calls. Normally some of them should be encapsulated below the others.

Encapsulation hides stuff, but if the hidden parts are really just sub-parts of the parent, then it is a very good form of hiding. If you understand what the parent needs to do, then you would assume that the children should be there.

Also, the function names should be self-describing, not just letters of the Alphabet.

Still, you often see coding nearly as bad as the above, when it should have been written out fully.

Getting back to the point, if you had 60K of code that spelled out everything in full, instead of the 30K of cryptic code, or the 300K of redundant code, then you have probably reached the best code possible. Not too big, but also not too small. Lots of languages provide plenty of synaptic sugar, but you really only want to use those features to make the code more readable, not just smaller.

Friday, December 6, 2024

The Right Thing

Long ago, rather jokingly, someone gave me specific rules for working, the most interesting of them was “Do the Right thing”.

In some cases with software development, the right thing is tricky. Is it the right thing for you? For the project? For using a specific technology?

In other cases though, it is fairly obvious.

For example, even if a given programming language lets you get crazy with spacing, you do not. You format the source code properly, each and every time. The formatting of any source file should be clean.

In one sense, it doesn’t bug the computer if the spaces are messed up. But source code isn’t just for a computer. You’ll probably have to reread it a lot, and if the code is worth using, aka well written, lots of other people will have to read it as well. We are not concerned with the time it takes to type in the code or edit it carefully to fix any spacing issues. We are concerned with the friction that bad spacing adds when humans are reading the code. The right thing to do is not add unnecessary friction.

That ‘do extra now, for a benefit later’ principle is quite common when building software software. Yet we see it ignored far too often.

One place it applies strongly, but is not often discussed is with hacks.

The right thing to do if you encounter an ugly hack below you is not to ignore it. If you let it percolate upwards, the problem with it will continue. And continue.

Instead, if you find something like that, you want to encapsulate it in the lowest possible place to keep it from littering your code. Don’t embrace the ugliness, don’t embrace the mess. Wrap it and contain it. Make the wrapper act the way the underlying code should have been constructed.

The reason this is right is because of complexity. If everybody lets everybody else’s messes propagate everywhere, the artificial complex in the final effort will be outrageous. You haven’t built code, you just glued together a bunch of messes into a bigger mess. The value of what you wrote is scarce.

A lot of people like the ‘not my problem’ approach. Or the ‘do the laziest thing possible’ one. But the value you are building is essentially crippled with those approaches. You might get to the end quicker, but you didn’t really do what was asked. If they asked me to build a system, it is inherent that the system should be at least good enough. If what I deliver is a huge mess, then I failed on that basic implicit requirement. I did not do my job properly.

So the right thing is to fix the foundational messes by properly encapsulating them so that what is built on top is 1000x more workable. Encapsulate the mess, don’t percolate it.

If you go back and look at much older technology, you definitely see periods where doing the right thing was far more common. And not only does it show, but we also depend on that type of code far more than we depend on the flakey stuff. That’s why the length of time your code is in production is often a reasonable proxy to its quality, which is usually a manifestation of the ability of some programmers to find really clean ways of implementing complicated problems. It is all related.