Friday, January 17, 2025

Complexity

Often, when people encounter intense complexity in their path, they choose to believe that an oversimplification, rather than the truth, is a better choice.

In software terms, they are asked to provide a solution for a really hard, complex, problem. The type of code that would choke most programmers. Instead, they avoid it and provide something that sort of, kind of, works a little closer to what was needed, but was never really suitable.

That’s a misfitting solution. It spawns all sorts of other problems as an explosion of unaddressed fragments. So they go into firefighting mode, trying to put out all of these secondary fires, but it only gets worse.

The mess and the attempt to bring it back under control can take far more time and effort than if they had just tackled the real problem. These types of “shortcuts” are usually far longer. They become a black hole sucking in all sorts of other stuff, spinning wildly out of control. Sometimes they never, ever, really work properly. The world is littered with such systems. Half-baked road hazards, just getting in people’s way.

Now it may be that the real solution would cross some sort of impenetrable boundary and be truly impossible, but more often it just takes a long time, a lot of concentration, and needs an abstraction or two to anchor it. If you carefully look at it from the right angle it is very tractable. You just have to spend the time looking for that viewpoint.

But instead of stepping back to think, people dive in. They just apply brute force, in obvious ways, trying to pound it all into working.

If it's a lot of complexity and you try to outrun it, you’ll end up with so much poor code that you’ll never really get any of it to work properly. If instead, you try to ignore it, it will return to haunt you in all sorts of other ways.

You need to understand it, then step up a level or two to find some higher ground that covers and encapsulates it. That code is abstract but workable.

It will be tough to implement, but as you see the bugs manifest, you rework the abstraction to correct the behavior. You’ll get far fewer bugs, but they will be far harder to solve. You can’t just toss on bandaids, they require deep refactoring each time. Still, once solved they won’t return or cascade, which ultimately makes it all a whole lot easier. It’s slower in the beginning but pays off.

The complexity of a solution has to match the complexity of the problem it is trying to solve. There is no easy way around this, you can’t just cheat and hope it works out. It won’t. It never has.

Saturday, January 4, 2025

Data Collection

There are lots of technologies available that will help companies avoid spending time organizing their data. They let them just dump it all together, then pick it apart later.

Mostly, that hasn’t worked very well. Either the mess renders the data lost in the swamp, or the resource usage is far too extreme to offset the costs.

But it also isn’t necessary.

The data that a company acquires is data that it specifically intends to collect. It’s about their products and services, customers, internal processes, etc. It isn’t random data that randomly appears.

Mining that data for knowledge might, at very low probabilities, offer some surprises, but likely the front line of the business alreadys knows these even if it isn’t getting communicated well.

Companies also know the ‘structure’ of the data in the wild. It might change periodically, or be ignored for a while, but direct observations can usually describe it accurately. Strong analysis saves time.

Companies collect data in order to run at larger scales. So, with a few exceptions, sifting through that data is not exploratory. It’s an attempt to get a reliable snapshot of the world at many different moments.

There are exploratory tasks for some industries too, but these are relatively small in scope, and they are generally about searching for unexpected patterns. But this means that first, you need to know the set of expected patterns. That step is often skipped.

Mostly, data isn’t exotic, it isn’t random, and it shouldn’t be a surprise. If there are a dozen different representations for it when it is collected, that is a mistake. Too often we get obsessed about technology but forget about its purpose. That is always an expensive mistake.

Friday, December 20, 2024

Datasource Precedents

A common problem in all code is juggling a lot of data.

If you are going to make your code base industrial-strength one key part of that is to move any and all static data outside of the code.

That includes any strings, particularly since they are the bain of internationalization.

That also includes flags, counters, etc.

A really strong piece of code has no constant declarations. Not for the logging, nor the configuration, or even for user options. It still needs those constants, but they come from elsewhere.

This goal is really hard to achieve. But where it has been mostly achieved you usually see the code lasting a lot longer in usage.

What this gives you is a boatload of data for configurations, as well as the domain data, user data, and system data. Lots of data.

We know how to handle domain data. You put it into a reliable data store, sometimes an RDBMS, or maybe NoSQL technology. We know that preventing it from being redundant is crucial.

But we want the same thing for the other data types too.

It’s just that while we may want an explicit map of all of the possible configuration parameters, in most mediums whether that is in a file, the environment, or at the cli, we usually want to amend a partial set. Maybe the default for 10,000 parameters is fine, but on a specific machine we need to change two or three. This changes how we see and deal with configuration data.

What we should do instead is take the data model for the system to be absolutely everything that is data. Treat all data the same, always.

Then we know that we can get it all from a bunch of different ‘sources’. All we have to do is establish a precedence and then merge on top.

For example, we have 100 weird network parameters that get shoved into calls. We put a default version of each parameter in a file. Then when we load that file, we go into the environment and see if there are any overriding changes, and then we look at the command line and see if there are any more overriding changes. We keep some type of hash to this mess, then as we need the values, we simply do a hash lookup to get our hands on the final value.

This means that a specific piece of code only references the data once when it is needed. That there are multiple loaders that write over each other, in some form of precedence. With this it is easy.

Load from file, load on top from env, load on top from cli. In some cases, we may want to load from a db too (there are fun production reasons why we might want this as the very last step).

We can validate the code because it uses a small number of parameters all in the place they are needed. We can validate the default parameters. We can write code to scream if an env var or cli param exists but there are no defaults. And so forth. Well structured, easy to understand, easy to test, and fairly clean. All good attributes for code.

The fun part though is that we can get crazier. As I said, this applies to all ‘data’ in the system, so we can span it out over all other data sources, and tweak it to handle instances as well as values. In that sense, you do something fun like use a cli command with args that fill in the raw domain data, so that you can do some pretty advanced testing. The possibilities are endless, but the code is still sane.

More useful, you can get some large-scale base domain data from one data source, then amend it with even more data from a bunch of other data sources. If you put checks on validity and drop garbage, the system could use a whole range of different places to merge the data by precedence. Start with the weakest sources first, and use very strong validation. Then loosen up the validation and pile other sources on top. You’ll lose a bit of the determinism overall, but offset that with robustness. You could do it live, or it could be a constant background refresh. That’s fun when you have huge fragmented data issues.

Thursday, December 12, 2024

In Full

One of the key goals of writing good software is to minimize the number of lines of code.

This is important because if you only needed 30K lines, but ended up writing 300K, it was a lot of work that is essentially wasted. It’s 10x the amount of code, 10x the amount of testing, and 10x the bugs. As well, you have to search through 10x lines to understand or find issues.

So, if you can get away with 30k, instead of 300K, then is it always worth doing that?

Well, almost always.

30K of really cryptic code that uses every syntactic trick in the book to cleverly compress the code down to almost nothing is generally unreadable. It is too small now.

You always have to revisit the code all of the time.

First to make sure it is reusable, but then later to make sure it is doing the right things in the right way. Reading code is an important part of keeping the entire software development project manageable.

So 30K of really compressed, unreadable stuff, in that sense is no better than 300K of really long and redundant stuff. It’s just a similar problem but in the opposite direction.

Thus, while we want to shrink down the representation of the code to its minimum, we oddly want to expand out the expression of that code to its maximum. It may seem like a contradiction, but it isn’t.

What the code does is not the same as the way you express it. You want it to do the minimum, but you usually want the expression to be maximized.

You could for instance hack 10 functions into a complex calling chain something like A(B(C(D(E(F(G(H(I(K(args))))))) or A().B().C().D().E().F().G().H().I().K(args) so it fits on a single line.

That’s nuts, and it would severely impair the ability of anyone to figure out what either line of code does. So, instead, you put the A..K function calls on individual lines. Call them one by one. Ultimately it does the same thing, but it eats up 10x lines to get there. Which is not just okay, it is actually what you want to do.

It’s an unrealistic example since you really shouldn’t end up with 10 individual function calls. Normally some of them should be encapsulated below the others. 

Encapsulation hides stuff, but if the hidden parts are really just sub-parts of the parent, then it is a very good form of hiding. If you understand what the parent needs to do, then you would assume that the children should be there. 

Also, the function names should be self-describing, not just letters of the Alphabet. 

Still, you often see coding nearly as bad as the above, when it should have been written out fully.

Getting back to the point, if you had 60K of code that spelled out everything in full, instead of the 30K of cryptic code, or the 300K of redundant code, then you have probably reached the best code possible. Not too big, but also not too small. Lots of languages provide plenty of synaptic sugar, but you really only want to use those features to make the code more readable, not just smaller.

Friday, December 6, 2024

The Right Thing

Long ago, rather jokingly, someone gave me specific rules for working, the most interesting of them was “Do the Right thing”.

In some cases with software development, the right thing is tricky. Is it the right thing for you? For the project? For using a specific technology?

In other cases though, it is fairly obvious.

For example, even if a given programming language lets you get crazy with spacing, you do not. You format the source code properly, each and every time. The formatting of any source file should be clean.

In one sense, it doesn’t bug the computer if the spaces are messed up. But source code isn’t just for a computer. You’ll probably have to reread it a lot, and if the code is worth using, aka well written, lots of other people will have to read it as well. We are not concerned with the time it takes to type in the code or edit it carefully to fix any spacing issues. We are concerned with the friction that bad spacing adds when humans are reading the code. The right thing to do is not add unnecessary friction.

That ‘do extra now, for a benefit later’ principle is quite common when building software software. Yet we see it ignored far too often.

One place it applies strongly, but is not often discussed is with hacks.

The right thing to do if you encounter an ugly hack below you is not to ignore it. If you let it percolate upwards, the problem with it will continue. And continue.

Instead, if you find something like that, you want to encapsulate it in the lowest possible place to keep it from littering your code. Don’t embrace the ugliness, don’t embrace the mess. Wrap it and contain it. Make the wrapper act the way the underlying code should have been constructed.

The reason this is right is because of complexity. If everybody lets everybody else’s messes propagate everywhere, the artificial complex in the final effort will be outrageous. You haven’t built code, you just glued together a bunch of messes into a bigger mess. The value of what you wrote is scarce.

A lot of people like the ‘not my problem’ approach. Or the ‘do the laziest thing possible’ one. But the value you are building is essentially crippled with those approaches. You might get to the end quicker, but you didn’t really do what was asked. If they asked me to build a system, it is inherent that the system should be at least good enough. If what I deliver is a huge mess, then I failed on that basic implicit requirement. I did not do my job properly.

So the right thing is to fix the foundational messes by properly encapsulating them so that what is built on top is 1000x more workable. Encapsulate the mess, don’t percolate it.

If you go back and look at much older technology, you definitely see periods where doing the right thing was far more common. And not only does it show, but we also depend on that type of code far more than we depend on the flakey stuff. That’s why the length of time your code is in production is often a reasonable proxy to its quality, which is usually a manifestation of the ability of some programmers to find really clean ways of implementing complicated problems. It is all related.