Thursday, September 14, 2023

Modelling Complex Data

Usually, the point of building a software system is to collect data. The system is only as good as the data that it persists for a long time.

Machines go up and down, so data is really only persisted when it is stored in a long-term device like a hard disk.

What you have in memory is transitory. It may stay around for a while, it may not. A system that collects stuff, but accidentally forgets about it sometimes, is useless. You can not trust it, and trust is a fundamental requirement for every piece of software, large and small.

But even if you manage to store a massive amount of data, if it is a chaotic mess it is also useless. You store data so you can use it later; if that isn’t possible, then you haven’t really stored it.

So, it is extremely important that the data you store is organized. Organization is the means of retrieving it.

A long, long time ago everybody rolled their own persistence. It was a disaster. Then relational databases were discovered and they dominated. They work incredibly well, but they are somewhat awkward to use and you need to learn a lot of stuff in order to use them properly. Still, we had decades of being reliable.

NoSQL came along as an alternative, but to get the most out of the tech people still had to understand concepts like relational algebra and normalization. They didn’t want to, so things returned to the bad old days were people effectively rolled their own messes.

The problem isn’t the technology, it is the fact that data needs to be organized to be useful. Some new shiny tech that promises to make that issue go away is lying to you. You can’t just toss the data somewhere and figure it out later. All of those promises over the decades ended in tears.

Realistically, you cannot avoid having at least one person on every team that understands how to model persistent data. More is obviously better. Like most things in IT, from the outside, it may seem simple but it is steeped in difficulties.

The first fundamental point is that any sort of redundant data is bad. Computers are stupid and merging is mostly undeniable, so it’s not about saving disk space, but rather the integrity, aka quality, of the data. The best systems only store everything once, then the code is simpler and the overall quality of the system is always higher.

The second fundamental issue is that you want to utilize the capabilities of the computer to keep you from storing garbage. That is, the tighter your model matches the real world, the less likely it is choked with garbage.

The problem is that that means pedantically figuring out the breadth and depth of each and everything you need to store. It is a huge part of analysis, a specialized skill all on its own. Most people are too impatient to do this, so they end up paying the price.

To figure out a model that fits correctly to the problem domain means actually having to understand a lot of the problem domain. Many programmers are already so overwhelmed by the technological issues that they don’t want to poke into the domain ones too. Unfortunately, you have no choice. Coders code what they know, and if they are clueless as to what the users are doing, their code will reflect that. But also, coders with domain expertise are far more valuable than generic coders, so there's a huge career upside to learning what the users are doing with their software.

If you avoid redundant data and you utilize the underlying technology to its best abilities to ensure that the data you need fits tightly then it’s a strong foundation to build on top of.

If you don’t have this, then all of those little modeling flaws will percolate through the code, which causes it to converge rapidly on spaghetti. That is, the best code in the world will still be awful if the underlying persisted data is awful. It will be awful because either it lets the bad data through, or it goes to insane lengths to not let the bad data through. You lose either way. A crumbly foundation is an immediate failure out of the gate.

Time spent modeling the data ends up saving a lot of time that wasn’t wasted on hacking away at questionable fixes in the code. The code is better.

No comments:

Post a Comment

Thanks for the Feedback!