Thursday, February 13, 2025

Control

I’ve often written about the importance of reusing code, but I fear that that notion in our industry has drifted far away from what I mean.

As far as time goes, the worst thing you can do as a programmer is write very similar code, over and over and over again. We’ve always referred to that as ‘brute force’. You sit at the keyboard and pound out very specific code with slight modifications. It’s a waste of time.

We don’t want to do that because it is an extreme work multiplier. If you have a bunch of similar problems, it saves orders of magnitude of time to just write it once a little generally, then leverage it for everything else.

But somehow the modern version of that notion is that instead of writing any significant code, you just pile as many libraries, frameworks, and products as you can. The idea is that you don’t write stuff, you just glue it together for construction speed. The corollary is that stuff written by other people is better than the stuff you’ll write.

The flaw in that approach is ‘control’. If you don’t control the code, then when there is a problem with that code, your life will become a nightmare. Your ‘dependencies’ may be buggy. Those bugs will always trigger at the moment you don’t have time to deal with them. With no control, there is little you can do about some low-level bug except find a bad patch for it. If you get enough bad patches, the whole thing is unstable, and will eventually collapse.

You get caught in a bad cycle of wasting all of your time on things you can’t do anything about, so you don’t have the time anymore to break out of the cycle. It just sucks you down and down and down.

The other problem is that the dependencies may go rogue. You picked them for a subset of what they do, but their developers might really want to do something else. They drift away from you, so your glue gets uglier and uglier. Once that starts, it never gets better.

In software, the ‘things’ you don’t control will always come back to haunt you. Which is why we want to control as much as possible.

So, reusing your own stuff is great, but reusing other people’s stuff has severe inherent risks.

The best way to deal with this is to write your own version of whatever you can, given the time available. That is, throwing in a trivial library just because it exists is bad. You can look at how they implemented it, and then do your own version which is better and fits properly into your codebase. In that sense, it's nice that these libraries exist, but it is far safer to use them as examples for learning than to wire them up into your code.

There are some underlying components however that are super hard to get correct. Pretty much anything that deals with persistence falls into this category, as it requires a great deal of knowledge about transactional integrity to make the mechanics fault-tolerant. If you do it wrong, you get random bugs popping up all over the place. You can’t fix a super rare bug simply because you can not replicate it, so you’d never have any certainty that your code changes did what you needed them to do. Where there is one heisenbug, there are usually lots more lurking about.

You could learn all about low-level systems programming, fault tolerance, and such, but you probably don’t have the decade available to do that right now, so you really do want to use someone else’s code for this. You want to leverage their deep knowledge and get something nearly state-of-the-art.

But that is where things get complicated again. People seem to think that ‘newer’ is always better. Coding seems to come in waves, so sometimes the newer technologies are real actual improvements on the older stuff. The authors understood the state of the art and improved upon it. But only sometimes.

Sometimes the authors ignore what is out there, have no idea what the state of the art really is, and just go all the way back to first principles to make every old mistake again. And again. There might be some slight terminology differences that seem more modern, but the underlying work is crude and will take decades to mature if it does. You really don't want to be building on anything like that. It is unstable and everything you put on top will be unstable too. Bad technology never gets better.

So, you need to add other stuff you can’t control and it is inherently hazardous.

If you pick something trendy that is also flakey, you’ll just suffer a lot of unnecessary problems. You need to pick the last good thing, not the most recent one.

That is always a tough choice, but crucial to building stable stuff. As a consequence though, it is important to know that sometimes the choice made was bad, you picked a dude. Admit it early, since it is usually cheaper to swap that for something else as early as possible.

Bad dependencies are time sinks. If you don’t control it and can’t fix it when it breaks, then at the very least you need it to be trustworthy. Which means it is reliable and relatively straightforward to use. You never need a lot of features, and in most cases, you shouldn’t need a lot of configurations either. Just stuff that does exactly what it is supposed to do, all of the time. You want it to encapsulate all of the ugliness away from you, but you also want it to deal with that ugliness correctly, not just ignore it.

If you are picking great stuff to build on, then you get more time to spend building your own stuff, and if you aren’t just retyping similar code over and over again, you can spend this time keeping your work organized and digging deeply into the problems you face. You are in control. That makes coding a whole lot more enjoyable than just rushing through splatting out endless frail code. After all, programming is about problem-solving, and we want to keep solving unique high-quality problems, not redundantly trivial and annoying ones. Your codebase should build on your knowledge and understanding. That is how you master the art.

Tuesday, February 4, 2025

Integrated Documentation

Long ago, we built some very complex software.

We had a separate markdown wiki to contain all of the necessary documentation.

Over time, the main repo survived, but that wiki didn’t. All of the documentation was disconnected and thus was lost.

When I returned to the project years later, it was still in active usage, they needed it, but the missing documentation was causing chaos. They shot themselves in the foot.

Since then, I have put the documentation inside the repo with the rest of the source code. Keeping track of one thing in a large organization is difficult enough, trying to keep two different things in sync is impossible.

By now, we should be moving closer to literate programming: https://en.wikipedia.org/wiki/Literate_programming

Code without documentation is just a ball of mud. Code with documentation is a solution that hopefully solves somebody’s problems. Any nontrivial lump of code is complicated enough that it needs extra information to make it usable.

For repo cover sites like Github and Gitlab, if they offer some type of wiki for documentation, that wiki should be placed in the main project repo as a subdirectory. The markdown files are effectively source files. Any included files are effectively source files. They need to get versioned like everything else.

There has alway been this misunderstanding that source files must be ‘text’ and that for the most part, it is always ‘code’. That is incorrect. The files are the ‘source’ of the data, information, binaries, etc. It was common practice to put binary library files into the source repo, for example, when they had been delivered to the project from outside sources. Keeping everything together with a long and valid history is important. The only thing that should not be in a repo is secrets, as they should remain secret.

Otherwise, the repo should contain everything. If it has to be pulled from other sites, it should be very explicit versions, it should not be left to chance. If you go back to a historical version, it should be an accurate historical version, not a random mix of history.

A fully documented self-standing repo is a thing of beauty. A half-baked repo is not. We keep history to reduce friction and make our lives easier. It is a little bit of work, but worth it.

Friday, January 24, 2025

Self Describing

One of the greatest problems with software is that it can easily be disconnected.

There is a lot of code for some projects or functionality, but people can’t figure out what it was trying to do. It’s just a big jumble of nearly random instructions.

The original programmers may have mostly understood what it does and how it works, but they may not have been able to communicate that information to everyone who may be interested in leveraging their work.

A big problem is cryptic naming.

The programmers pick acronyms or short versions for their names instead of spelling out the full words. Some acronyms are well-known or mostly obvious, but most are eclectic and vague. They mean something to the programmers, but not to anyone else. A name that only a few people understand is not a good name.

That notion that spelling everything out to be readable is a waste of time is un unfortunate myth of the industry. Even if it saves you a few minutes typing, it is likely to eat hours or days of somebody else’s time.

Another problem is the misuse of terminology.

There may have been a long-established meaning for some things, but the programmers weren’t fully aware of those definitions. Instead, they use the same words, but with a slight or significant change in the meaning. Basically, they are using the words wrong. Anyone with a history will be confused or annoyed by the inappropriate usage. That would lead other people astray.

Some programming cultures went the other way.

They end up spelling everything out in full excessive detail, and it is the excess length of the names that tends to make them easily misunderstood. They throw up a wall of stuff that obscures the parts underneath. We don’t need huge extensive essays on how the code works, just as we do need something extra information besides the code itself. Finding that balance is part of mastering programming.

Stuttering is a common symptom of severe naming problems. You’ll see parent/child relationships that have the exact same names. You never need to repeat the same string twice, but it has become rather too common to see that in code or file systems. For some technologies, it is too easy to stutter, but it's always a red flag that indicates that people didn’t take the time to avoid it. It makes you wonder what other shortcuts they took as well.

Ultimately a self-describing name is one that gives all of the necessary information that a qualified person needs to get an understanding or to utilize something. There is always a target audience, but it is usually far larger than most programmers are willing to admit.

If you put your code in front of another programmer and they don’t get it, or they make very invalid assumptions about what it does, it is likely a naming problem. You can’t get help from others if they don’t understand what you are trying to do, and due to its complexity, serious programming has evolved into needing teams of people to work on it rather than just individuals.

Modern-day programming is slogging through a rather ugly mess of weird syntax, inconsistencies, awkwardness, confusion, and bugs galore. People used to take the time to make sure their work was clean and consistent, but now most of it is just ugly and half-baked, an annoyance to use. Wherever possible, we should try to avoid creating or using bad technologies, they do not make the world a better place.

Friday, January 17, 2025

Complexity

Often, when people encounter intense complexity in their path, they choose to believe that an oversimplification, rather than the truth, is a better choice.

In software terms, they are asked to provide a solution for a really hard, complex, problem. The type of code that would choke most programmers. Instead, they avoid it and provide something that sort of, kind of, works a little closer to what was needed, but was never really suitable.

That’s a misfitting solution. It spawns all sorts of other problems as an explosion of unaddressed fragments. So they go into firefighting mode, trying to put out all of these secondary fires, but it only gets worse.

The mess and the attempt to bring it back under control can take far more time and effort than if they had just tackled the real problem. These types of “shortcuts” are usually far longer. They become a black hole sucking in all sorts of other stuff, spinning wildly out of control. Sometimes they never, ever, really work properly. The world is littered with such systems. Half-baked road hazards, just getting in people’s way.

Now it may be that the real solution would cross some sort of impenetrable boundary and be truly impossible, but more often it just takes a long time, a lot of concentration, and needs an abstraction or two to anchor it. If you carefully look at it from the right angle it is very tractable. You just have to spend the time looking for that viewpoint.

But instead of stepping back to think, people dive in. They just apply brute force, in obvious ways, trying to pound it all into working.

If it's a lot of complexity and you try to outrun it, you’ll end up with so much poor code that you’ll never really get any of it to work properly. If instead, you try to ignore it, it will return to haunt you in all sorts of other ways.

You need to understand it, then step up a level or two to find some higher ground that covers and encapsulates it. That code is abstract but workable.

It will be tough to implement, but as you see the bugs manifest, you rework the abstraction to correct the behavior. You’ll get far fewer bugs, but they will be far harder to solve. You can’t just toss on bandaids, they require deep refactoring each time. Still, once solved they won’t return or cascade, which ultimately makes it all a whole lot easier. It’s slower in the beginning but pays off.

The complexity of a solution has to match the complexity of the problem it is trying to solve. There is no easy way around this, you can’t just cheat and hope it works out. It won’t. It never has.

Saturday, January 4, 2025

Data Collection

There are lots of technologies available that will help companies avoid spending time organizing their data. They let them just dump it all together, then pick it apart later.

Mostly, that hasn’t worked very well. Either the mess renders the data lost in the swamp, or the resource usage is far too extreme to offset the costs.

But it also isn’t necessary.

The data that a company acquires is data that it specifically intends to collect. It’s about their products and services, customers, internal processes, etc. It isn’t random data that randomly appears.

Mining that data for knowledge might, at very low probabilities, offer some surprises, but likely the front line of the business alreadys knows these even if it isn’t getting communicated well.

Companies also know the ‘structure’ of the data in the wild. It might change periodically, or be ignored for a while, but direct observations can usually describe it accurately. Strong analysis saves time.

Companies collect data in order to run at larger scales. So, with a few exceptions, sifting through that data is not exploratory. It’s an attempt to get a reliable snapshot of the world at many different moments.

There are exploratory tasks for some industries too, but these are relatively small in scope, and they are generally about searching for unexpected patterns. But this means that first, you need to know the set of expected patterns. That step is often skipped.

Mostly, data isn’t exotic, it isn’t random, and it shouldn’t be a surprise. If there are a dozen different representations for it when it is collected, that is a mistake. Too often we get obsessed about technology but forget about its purpose. That is always an expensive mistake.