Thursday, March 27, 2025

Reinventing the Wheel

An all too common phrase in programming is “Don’t reinvent the wheel”. However, I believe that the phrase is often used incorrectly.

First, ‘the wheel’ is a theoretical concept. It is the notion that if something is ‘round’, it will ‘roll’, and you can put that under other stuff in order to move it forward. A bike tire or car tire are wheels, but creating a new type of tire is not ‘reinventing the wheel’; it is actually just reimplementing one. You haven’t reinvented it unless you replace ‘round’ with something else, like a hexagon or tank track.

In software, that means that if you write your own persistence solution so that you can avoid using an RDBMS, you have not reinvented persistence; you just went your own way for the implementation.

On the web, we do occasionally see really neat alternatives for ‘wheels’. Mostly, though our vehicles and furniture stick with the familiar old round things that have been around for millennia.

The reason is simple. The concept of a wheel works really well. It is state-of-the-art for moving things. In a specific case, you may need to implement your own variation of it; maybe air in a rubber layer is inappropriate, but rarely do you need to revisit the theory. It is fine.

It is also good that other people are exploring alternatives, as some of them seem quite useful, but if you are asked to engineer something that moves right now, the theory of a wheel is probably the best place to start. Then, maybe, you can leverage an existing implementation like a castor to get yourself going.

All of this is the same for software. You are asked to put together a big system or product, so you break it down into small parts. If one of those parts is ‘persistence’, you first look at an RDBMS or NoSQL. You evaluate the properties of both and likely choose an implementation that honestly best fits your requirements.

You would not commit to doing your own implementation unless there is a particularly nasty requirement that is absolutely not met by any of the existing implementations.

But if you did decide to go your own way, first you would acquire the full knowledge of all of the theories, the state-of-the-art for persistence. You’d know it all, from the distributed fault tolerance perspectives all of the way over to data normalization.

You can’t skip parts because you’d have to reinvent them, and since most of them are decades old, you’d be falling right back to ground zero. You don’t want some piece to be crude if there is already an implementation out there that is not crude.

Given the requirement to know the state of the art, it is pretty much not an option for most people to write their own persistent solution. They don’t have the knowledge to get their implementation up to the state-of-the-art, and going far lower isn’t going to net them the benefits that they need. This is the core of “don’t reinvent the wheel”, which is that you shouldn’t, unless you really can. And you really can when you know a crazy amount of stuff and also have a crazy amount of time to get it implemented; both are rare circumstances. So, don't do it.

But while persistence is a very deep and broad knowledge base, people frequently make the mistake of applying that logic to much simpler issues.

You could, for instance, reinvent or reimplement your own widget. Reinvention is trickier as the widget will be perceived as eclectic for most users, so they probably won’t like it, but there are times when a brand new type of widget is actually what is needed. Rare, but possible. As for reimplementing, if you fix enough of the annoyances with most modern widget sets in order to lift up their quality, then it is probably very worthwhile. Or maybe you just wrap all of the other widgets to fix their inconsistencies, another variation on reimplementation that often makes sense.

This applies to all sorts of places in the tech stacks where the current implementations are weak or just glue over something else. You probably don’t need to reinvent them, but it may make your life considerably more comfortable to reimplement them, or at least wrap them up nicely. These two types of coding tasks are really good but, unfortunately, are often confused with reinvention. You’re not reinventing the wheel; you are just applying your knowledge to get a better implementation somehow. Of course, to do this successfully, you need that knowledge first. Without that, you are just going backward.

Friday, March 21, 2025

Herding Cats

The idiom about “herding cats” is frequently used in IT and plays out in many different places in the industry.

It adequately describes programmers. Because the industry keeps expanding far faster than people can follow, the culture tends to overcompensate. Programmers are frequently crazy overconfident in their beliefs, views, and opinions, as it oddly tends to help in their careers. But it also means a high rate of failure and a lot of personality clashes.

I always figured that if we got the 10 greatest programmers on the planet all together on the same project, instead of producing magic, they would just dissolve into bickering with each other. The cats all want to go off in their own direction.

That tendency plays out in technology as well.

Microservices was a super popular idea. Instead of everyone coming together to build something, it offers the dream that everyone could just build little fully independent pieces that would all magically work together. So, you don’t have to compromise for the team, you just do what you know is right, toss it into the mix, and it will be good.

But of course, it doesn’t work that way. If at the binding level, the sum of the services is hugely disorganized and irrational, the whole thing just isn’t going to be useful or stable.

The health of the ‘forest’, dominates the health of the ‘trees’. Each cat tending to their own little tree without having to worry about the other trees only works if all of the trees are contained and arranged nicely. If the “tree” however, is really various branches from a lot of other trees scattered all over the forest, it's nuts. The parts need to be perfectly independent and totally aligned, which is a contradiction.

It shows up in products too.

Most, if not all, big products were evolutionary. A small group of people got it going, and then it was scaled way up over a very long time to reach its current level of maturity. So, it’s not uncommon that hundreds or even thousands of programmers were involved over decades.

That initial small group likely had a strong tight focus, which is why the product didn’t fail. But as people come and go that focus gets watered down, realigned, and bounced all over the place. That shows up in the product. It goes from being a tight nicely arranged set of functionality to being a pile of mistaching features, erratically shoved into strange corners of the interface.

Before maturity, there wasn’t as much functionality, but you could always find it. After maturity, everything including the kitchen sink is there, but now it is tricky or impossible to find, and behaves inconsistently with everything else.

We see it with in-house software as well. A friend of mine used to love to say that companies get the software they deserve. If you look at most big non-software companies, their IT depts are a whacking great mess of chaos, bugs, and defective code.

Since personality issues tend to be the downfall of large internal projects, most of the stuff is little bits of glue haphazardly applied to everything.

It isn’t even a ball of mud, but rather a gaint field of wreckage. And the software industry loves this. They endlessly keep selling magic bullets to fix the ever-growing disaster, but generally just make it worse somehow. Generation after generation of technology that makes bold claims and then disappoints.

In so long as programmers want the freedom to go off and do their own thing, ignoring what they don’t want to do or know, then the results of their work have strictly limited usefulness.

If it is fully eclectic, it fits with nothing. Yes, it is a creative work, but that creativity is what keeps it from being useful.

If it is fully standard, to whatever the larger context is, there is almost no freedom whatsoever. You figure out the standards, implement them correctly, and then put state-of-the-art pieces below. Almost no creativity, but a massive amount of prior knowledge. The thing is great and works with everything else, but it is as it should be, not as you’d like it to be.

The work is figuring out what it should be. This is why the programmers who can successfully work with the greatest programmers on the planet have more value than their peers. They are humble and know that creativity isn’t necessary. Research is a better use of their time. Get as close to standard as you can in the time you are allowed.

The broader theme is that ‘focus’ is a huge part of quality. If the work is large and unfocused, there will be plenty of problems with it and if there are too many of them it undoes any or all of the value of the work itself. If, however, the work is tightly focused then most of the contributors must orient their work in that direction, which takes a lot of the fun out of it. More focus means less freedom, but also better results. There does not seem to be any way around this.

Friday, March 14, 2025

Development Automation

One popular programming fallacy is that it is not worth spending five hours writing a script that runs in a few minutes.

Mostly, this perspective seems to come from people who don’t like scripting. They want to have just one primary coding language; they only want to stay working in that language all of the time. Scripting usually requires at least one other language and lots of little tools. It is different from the other coding tasks. It’s more stuff to learn.

This perspective is wrong for a bunch of reasons.

First, if you spend five hours writing a script and then run it thousands of times, it obviously pays off. It still pays off even if you run it hundreds of times, and probably even if you run it only a dozen.

But if you’ve scripted all sorts of stuff, even if a few of those scripts don’t get run very often, the more you script, the faster you will get at doing it. Practice makes perfect, but it also makes it faster.

If you’ve mostly scripted everything, then creating new variations on other scripts is nearly trivial. You have lots of examples to work off of. You don’t want a billion brute force scripts, but most of the things you are automating are essentially similar. Sometimes you refactor to expand scope (preferred), but other times you just copy and modify to get it going.

And whenever you run the script, the results will be deterministic. A good script will run the same every time. You have at least semi-automated the tasks if not better. That’s what we do for users; why not for ourselves?

If, instead of scripting, you need to click on 10 buttons each time, the chances are that you’ll miss some steps. If you need to type 10 long CLI commands, it is even worse. More chances to forget stuff.

The cost of getting it wrong is always way more expensive than the automation costs. You do a PROD install and forget a tiny piece; it isn’t just the time you have blown, it is also the loss of trust from the people around you. Oddly, that often results in more meetings and more tracking. A history of screwups is very costly.

And, of course, documentation. If someone new starts, you point them to the scripts. If you step away for a while, you point yourself to them. They are a record of the things you are doing every day. You can’t really share them with non-technical people, but that is fine, even better.

Finally, encapsulation. If you bury the ugly grunge into a script, you can forget about it. Pretend it never happened. You’ve stripped off some small annoying piece of complexity and replaced it with a simple command. If you keep that up, there is far less friction and far more accuracy, which gives you far more time to concentrate on the newer, bigger, far more interesting problems.

Scripting everything, or at least as much as you can, is and has always been a super strong habit. Part of mastering coding.

Ironically, while the main languages have changed radically over the decades, scripting has not. You can script well with early 90s techniques, and it will be as useful today as it was then.

Friday, March 7, 2025

Concurrency

For large complex code, particularly when it is highly configurable, polymorphic, or has dependency injection, it is sometimes difficult to get a clear and correct mental model of how it actually runs. The code might bounce around in unexpected ways. Complicated flow control is the source of a lot of bugs.

Now take that, but interlace multiple instances of it, each interacting with the same underlying variables, and know that it can get far worse.

The problem isn’t the instructions; it is any and all of the variables that they touch.

If there is a simple piece of code that just sets two variables, it is possible that one instance sets the first one, but then is leapfrogged by another instance setting both. After the first instance finally sets the last variable, the state is corrupt. The value of the variables is a mix between the two concurrent processes. If the intent of the instructions was to make sure both are consistent, it didn’t work.

If it was just an issue with a single function call that could be managed by adding a critical section on top of the two assignments, but more often than not, a function call is the root of a tree of execution that can include all sorts of other function calls with similar issues.

A simple way around this is to add big global locks around stuff, but that effectively serializes the execution, thus defeating most of the benefits of concurrency. You just added a boatload of extra complexity for nothing.

It doesn’t help if you are just reading variables if they can be changed elsewhere. A second writing thread could be half completed during the read, which is the same problem as before.

You can make it all immutable, but you still have to be very concerned with the lifetime of the data. Where it is loaded or deleted from memory can get corrupted too. You can only ignore thread safety when everything is preloaded first, strictly immutable, and never deleted.

Most days, concurrency is not worth the effort. You need it for multi-use code like a web server, but you also want to ensure the scope of any variable is tightly bound to the thread of execution. That is, at the start of the thread, you create everything you need, and it is never shared. Any global you need is strictly immutable and never changes or gets deleted. Then you can ignore it all. Locking is too easy to get wrong.

Some neat language primitives like async seem to offer simple-to-use features, but if you don’t understand their limits, the cost is Heisenbugs. Strange concurrency corruptions that are so low in frequency people confuse them with random hiccups. They might occur once per year, for example, so they are entirely impossible to replicate; they tend to stay around forever and agitate the users.

If you aren’t sure, then serial execution is best. At least for the start. If a language offers features to catch absolutely every possible concurrent issue, that is good too, but one that only catches ‘most’ of them is useless because ‘some’ of them are still out there.

Most concurrent optimizations are only ever micro optimizations. The performance gains do not justify the risks involved. It is always far better to focus on normalizing the data and being frugal with the execution steps, as they can often net huge macro optimizations.

The hard part about concurrency is not learning all of the primitives but rather having a correct mental model of how they all interoperate so that the code is stable.

Friday, February 28, 2025

The Digital Realm

There is the real world. The physical one has been around for as long as humans can remember.

Then there is the digital world, which is an artificially constructed realm based on top of millions, possibly billions or even trillions, of interconnected computers.

Hardware always forms the sub-structure. The foundation. It is what binds the digital realm to reality.

What’s above that is just data and code. Nothing else.

Any other thing that can be imagined in all ways is either data, code, or a combination of the two.

Data is static. It just exists as it is. You can really only change it by writing some other data on top of it, wiping the original copy out of existence.

Code is active. It is a list of instructions, often crazy long, sometimes broken up in countless pieces spread across all soft of places.

Code ‘runs’. Something marches through it, effectively instruction by instruction, executing it, in more or less a deterministic fashion.

Code is data long before it is code. That is because it is a ‘list’ of instructions; when it is not running it is just a list of things. It is data when inactive.

Data can effectively be code. You can declare a whack load of data that is interpreted as ‘high-level’ code to trigger very broad instruction sets.

Data is not just bits and bytes. It is not just single pieces of information encoded in some fashion. Most data only has value if it is used in conjunction with related data. Those groups have structure, whether it is a collection of individual data points or a list of stuff. There are higher level structure relationships too, like dags, trees, graphs, and hypergraphs. Mostly, but not always, the individual parts and their various structures have some names associated with them. Meta-data really. Information about how all the individual points related back to each other. Data about the structure of the underlying data.

In it’s simplest sense, data corresponds to the way we use nouns in language, code corresponds to verbs. We blur the lines for some sophisticated usage, but most forms of programming tend towards keeping them separate and distinct.

We know we need to secure data. It is the heart and soul of the information we are collecting with our computers. That information could be used by malicious people for bad ends. But we also need to secure code. Not just when it is data but also as it executes. As they are distinct, one means of securing them will never cover both; they are, in effect, two different dimensions. Thus, we need two different and distinct security models, each of which covers its underlying resource. They won’t look similar; they can not be blended into one.

Saturday, February 22, 2025

Mastery

Oddly, mastering programming is not about being able to spew out massive sets of instructions quickly. It’s about managing cognitive load. Let me explain.

Essentially, we have a finite amount of cognitive horsepower, most of which we tend to spend on our professions, but life can be quite demanding too.

So that’s the hard limit to our ability to work.

If you burn through that memorizing a whack load of convoluted tidbits while coding, it will utilize most of your energy. I often refer to this as ‘friction’.

So, what you need to make the best progress on your work as possible is to reduce as much of that friction as you can. You don’t want to spend your time thinking about ‘little’ things. Instead, you want to use it diving deeply into the substantial problems. The big ones that will keep you from moving forward directly.

In that sense, less is more. Far more.

For example, I’ve known for a long time that a million little bits of functionality is not all that helpful. It’s not a coherent solution.

It’s better to generalize it a little, then pack it all together and encapsulate it into larger ‘lego’ bricks. Now you have less things that you can call, but they do a far wider range of tasks. You still need to supply some type of configuration knowledge to them, but that too can be nicely packaged.

This fits with what I said initially about cognitive load. You don’t have to remember a million little functions. You don’t have to keep rewriting them. Once you have solved a problem, big or small, you package it up and lean on that solution later. So now you can forget about it. It’s done, it's easily findable. You get a bunch of those at a low level, you can build higher level ones on top.

Since you rarely need to work at that previous level now, you have far less to think about. It is a solved problem. As you go, you get higher and higher, the friction is less, and the work you are doing is far more sophisticated.

It was known long ago that good projects get easier to work on with time, bad ones do not. Easier because there is less friction and you can do larger things quickly with better accuracy. You’ve got this codebase that solves the little problems, so you can now work on bigger ones. That’s why we have system libraries for languages, for instance.

If you’ve got some good pieces and someone asks for something complicated, it is not that hard to extend it. But it is really hard to go right back to ground zero and do it all over again. You brain has to cope with lots of levels now, it will get overwhelmed. But if you can skip over all of the things already solved, then you can focus on the good stuff. The new stuff. The really hard stuff.

In a similar way, disorganization and inconsistencies eat through massive amounts of cognitive load. It is inordinately harder to work in a messy environment than a neat and tidy one. If all your tools are neatly lined up and ready when you need them, then jumping around is fluid. If you have to struggle through a mess just to find something, it bogs you down. Struggling through the mess is what you're doing, not solving the problems you need to solve.

So you learn that the dev shop and its environment needs to be kept as clean as time allows. And tidying up your working environment is almost always worth the time. Not because you are a neat freak, but because of how the friction will tire you out.

If you manage your cognitive load really well then you don’t have to spend it on friction. You can spend it on valuable things like understanding how the tech below you really works. Or what solutions will really help people with their problems.

The less time you spend on things like bizarre names, strange files, and cryptic configurations the more you have to spend on these deeper things, which helps you see straighter and more accurate paths to better solutions. In that sense the ‘works on my machine’ excuse by someone who exhausted themselves drowning in a mess is really just a symptom of their losing control over the tornado of complexity that surrounds them.

Thursday, February 13, 2025

Control

I’ve often written about the importance of reusing code, but I fear that that notion in our industry has drifted far away from what I mean.

As far as time goes, the worst thing you can do as a programmer is write very similar code, over and over and over again. We’ve always referred to that as ‘brute force’. You sit at the keyboard and pound out very specific code with slight modifications. It’s a waste of time.

We don’t want to do that because it is an extreme work multiplier. If you have a bunch of similar problems, it saves orders of magnitude of time to just write it once a little generally, then leverage it for everything else.

But somehow the modern version of that notion is that instead of writing any significant code, you just pile as many libraries, frameworks, and products as you can. The idea is that you don’t write stuff, you just glue it together for construction speed. The corollary is that stuff written by other people is better than the stuff you’ll write.

The flaw in that approach is ‘control’. If you don’t control the code, then when there is a problem with that code, your life will become a nightmare. Your ‘dependencies’ may be buggy. Those bugs will always trigger at the moment you don’t have time to deal with them. With no control, there is little you can do about some low-level bug except find a bad patch for it. If you get enough bad patches, the whole thing is unstable, and will eventually collapse.

You get caught in a bad cycle of wasting all of your time on things you can’t do anything about, so you don’t have the time anymore to break out of the cycle. It just sucks you down and down and down.

The other problem is that the dependencies may go rogue. You picked them for a subset of what they do, but their developers might really want to do something else. They drift away from you, so your glue gets uglier and uglier. Once that starts, it never gets better.

In software, the ‘things’ you don’t control will always come back to haunt you. Which is why we want to control as much as possible.

So, reusing your own stuff is great, but reusing other people’s stuff has severe inherent risks.

The best way to deal with this is to write your own version of whatever you can, given the time available. That is, throwing in a trivial library just because it exists is bad. You can look at how they implemented it, and then do your own version which is better and fits properly into your codebase. In that sense, it's nice that these libraries exist, but it is far safer to use them as examples for learning than to wire them up into your code.

There are some underlying components however that are super hard to get correct. Pretty much anything that deals with persistence falls into this category, as it requires a great deal of knowledge about transactional integrity to make the mechanics fault-tolerant. If you do it wrong, you get random bugs popping up all over the place. You can’t fix a super rare bug simply because you can not replicate it, so you’d never have any certainty that your code changes did what you needed them to do. Where there is one heisenbug, there are usually lots more lurking about.

You could learn all about low-level systems programming, fault tolerance, and such, but you probably don’t have the decade available to do that right now, so you really do want to use someone else’s code for this. You want to leverage their deep knowledge and get something nearly state-of-the-art.

But that is where things get complicated again. People seem to think that ‘newer’ is always better. Coding seems to come in waves, so sometimes the newer technologies are real actual improvements on the older stuff. The authors understood the state of the art and improved upon it. But only sometimes.

Sometimes the authors ignore what is out there, have no idea what the state of the art really is, and just go all the way back to first principles to make every old mistake again. And again. There might be some slight terminology differences that seem more modern, but the underlying work is crude and will take decades to mature if it does. You really don't want to be building on anything like that. It is unstable and everything you put on top will be unstable too. Bad technology never gets better.

So, you need to add other stuff you can’t control and it is inherently hazardous.

If you pick something trendy that is also flakey, you’ll just suffer a lot of unnecessary problems. You need to pick the last good thing, not the most recent one.

That is always a tough choice, but crucial to building stable stuff. As a consequence though, it is important to know that sometimes the choice made was bad, you picked a dude. Admit it early, since it is usually cheaper to swap that for something else as early as possible.

Bad dependencies are time sinks. If you don’t control it and can’t fix it when it breaks, then at the very least you need it to be trustworthy. Which means it is reliable and relatively straightforward to use. You never need a lot of features, and in most cases, you shouldn’t need a lot of configurations either. Just stuff that does exactly what it is supposed to do, all of the time. You want it to encapsulate all of the ugliness away from you, but you also want it to deal with that ugliness correctly, not just ignore it.

If you are picking great stuff to build on, then you get more time to spend building your own stuff, and if you aren’t just retyping similar code over and over again, you can spend this time keeping your work organized and digging deeply into the problems you face. You are in control. That makes coding a whole lot more enjoyable than just rushing through splatting out endless frail code. After all, programming is about problem-solving, and we want to keep solving unique high-quality problems, not redundantly trivial and annoying ones. Your codebase should build on your knowledge and understanding. That is how you master the art.

Tuesday, February 4, 2025

Integrated Documentation

Long ago, we built some very complex software.

We had a separate markdown wiki to contain all of the necessary documentation.

Over time, the main repo survived, but that wiki didn’t. All of the documentation was disconnected and thus was lost.

When I returned to the project years later, it was still in active usage, they needed it, but the missing documentation was causing chaos. They shot themselves in the foot.

Since then, I have put the documentation inside the repo with the rest of the source code. Keeping track of one thing in a large organization is difficult enough, trying to keep two different things in sync is impossible.

By now, we should be moving closer to literate programming: https://en.wikipedia.org/wiki/Literate_programming

Code without documentation is just a ball of mud. Code with documentation is a solution that hopefully solves somebody’s problems. Any nontrivial lump of code is complicated enough that it needs extra information to make it usable.

For repo cover sites like Github and Gitlab, if they offer some type of wiki for documentation, that wiki should be placed in the main project repo as a subdirectory. The markdown files are effectively source files. Any included files are effectively source files. They need to get versioned like everything else.

There has alway been this misunderstanding that source files must be ‘text’ and that for the most part, it is always ‘code’. That is incorrect. The files are the ‘source’ of the data, information, binaries, etc. It was common practice to put binary library files into the source repo, for example, when they had been delivered to the project from outside sources. Keeping everything together with a long and valid history is important. The only thing that should not be in a repo is secrets, as they should remain secret.

Otherwise, the repo should contain everything. If it has to be pulled from other sites, it should be very explicit versions, it should not be left to chance. If you go back to a historical version, it should be an accurate historical version, not a random mix of history.

A fully documented self-standing repo is a thing of beauty. A half-baked repo is not. We keep history to reduce friction and make our lives easier. It is a little bit of work, but worth it.

Friday, January 24, 2025

Self Describing

One of the greatest problems with software is that it can easily be disconnected.

There is a lot of code for some projects or functionality, but people can’t figure out what it was trying to do. It’s just a big jumble of nearly random instructions.

The original programmers may have mostly understood what it does and how it works, but they may not have been able to communicate that information to everyone who may be interested in leveraging their work.

A big problem is cryptic naming.

The programmers pick acronyms or short versions for their names instead of spelling out the full words. Some acronyms are well-known or mostly obvious, but most are eclectic and vague. They mean something to the programmers, but not to anyone else. A name that only a few people understand is not a good name.

That notion that spelling everything out to be readable is a waste of time is un unfortunate myth of the industry. Even if it saves you a few minutes typing, it is likely to eat hours or days of somebody else’s time.

Another problem is the misuse of terminology.

There may have been a long-established meaning for some things, but the programmers weren’t fully aware of those definitions. Instead, they use the same words, but with a slight or significant change in the meaning. Basically, they are using the words wrong. Anyone with a history will be confused or annoyed by the inappropriate usage. That would lead other people astray.

Some programming cultures went the other way.

They end up spelling everything out in full excessive detail, and it is the excess length of the names that tends to make them easily misunderstood. They throw up a wall of stuff that obscures the parts underneath. We don’t need huge extensive essays on how the code works, just as we do need something extra information besides the code itself. Finding that balance is part of mastering programming.

Stuttering is a common symptom of severe naming problems. You’ll see parent/child relationships that have the exact same names. You never need to repeat the same string twice, but it has become rather too common to see that in code or file systems. For some technologies, it is too easy to stutter, but it's always a red flag that indicates that people didn’t take the time to avoid it. It makes you wonder what other shortcuts they took as well.

Ultimately a self-describing name is one that gives all of the necessary information that a qualified person needs to get an understanding or to utilize something. There is always a target audience, but it is usually far larger than most programmers are willing to admit.

If you put your code in front of another programmer and they don’t get it, or they make very invalid assumptions about what it does, it is likely a naming problem. You can’t get help from others if they don’t understand what you are trying to do, and due to its complexity, serious programming has evolved into needing teams of people to work on it rather than just individuals.

Modern-day programming is slogging through a rather ugly mess of weird syntax, inconsistencies, awkwardness, confusion, and bugs galore. People used to take the time to make sure their work was clean and consistent, but now most of it is just ugly and half-baked, an annoyance to use. Wherever possible, we should try to avoid creating or using bad technologies, they do not make the world a better place.

Friday, January 17, 2025

Complexity

Often, when people encounter intense complexity in their path, they choose to believe that an oversimplification, rather than the truth, is a better choice.

In software terms, they are asked to provide a solution for a really hard, complex, problem. The type of code that would choke most programmers. Instead, they avoid it and provide something that sort of, kind of, works a little closer to what was needed, but was never really suitable.

That’s a misfitting solution. It spawns all sorts of other problems as an explosion of unaddressed fragments. So they go into firefighting mode, trying to put out all of these secondary fires, but it only gets worse.

The mess and the attempt to bring it back under control can take far more time and effort than if they had just tackled the real problem. These types of “shortcuts” are usually far longer. They become a black hole sucking in all sorts of other stuff, spinning wildly out of control. Sometimes they never, ever, really work properly. The world is littered with such systems. Half-baked road hazards, just getting in people’s way.

Now it may be that the real solution would cross some sort of impenetrable boundary and be truly impossible, but more often it just takes a long time, a lot of concentration, and needs an abstraction or two to anchor it. If you carefully look at it from the right angle it is very tractable. You just have to spend the time looking for that viewpoint.

But instead of stepping back to think, people dive in. They just apply brute force, in obvious ways, trying to pound it all into working.

If it's a lot of complexity and you try to outrun it, you’ll end up with so much poor code that you’ll never really get any of it to work properly. If instead, you try to ignore it, it will return to haunt you in all sorts of other ways.

You need to understand it, then step up a level or two to find some higher ground that covers and encapsulates it. That code is abstract but workable.

It will be tough to implement, but as you see the bugs manifest, you rework the abstraction to correct the behavior. You’ll get far fewer bugs, but they will be far harder to solve. You can’t just toss on bandaids, they require deep refactoring each time. Still, once solved they won’t return or cascade, which ultimately makes it all a whole lot easier. It’s slower in the beginning but pays off.

The complexity of a solution has to match the complexity of the problem it is trying to solve. There is no easy way around this, you can’t just cheat and hope it works out. It won’t. It never has.

Saturday, January 4, 2025

Data Collection

There are lots of technologies available that will help companies avoid spending time organizing their data. They let them just dump it all together, then pick it apart later.

Mostly, that hasn’t worked very well. Either the mess renders the data lost in the swamp, or the resource usage is far too extreme to offset the costs.

But it also isn’t necessary.

The data that a company acquires is data that it specifically intends to collect. It’s about their products and services, customers, internal processes, etc. It isn’t random data that randomly appears.

Mining that data for knowledge might, at very low probabilities, offer some surprises, but likely the front line of the business alreadys knows these even if it isn’t getting communicated well.

Companies also know the ‘structure’ of the data in the wild. It might change periodically, or be ignored for a while, but direct observations can usually describe it accurately. Strong analysis saves time.

Companies collect data in order to run at larger scales. So, with a few exceptions, sifting through that data is not exploratory. It’s an attempt to get a reliable snapshot of the world at many different moments.

There are exploratory tasks for some industries too, but these are relatively small in scope, and they are generally about searching for unexpected patterns. But this means that first, you need to know the set of expected patterns. That step is often skipped.

Mostly, data isn’t exotic, it isn’t random, and it shouldn’t be a surprise. If there are a dozen different representations for it when it is collected, that is a mistake. Too often we get obsessed about technology but forget about its purpose. That is always an expensive mistake.