Thursday, September 21, 2023

Historic Artifacts with Data

Software development communities have a lot of weird historical noise when it comes to data. I guess we’ve been trying so hard to ignore it, that we’ve made a total mess of it.

So, let’s try again:

A datam is a fixed set of bits. The size is set. It is always fixed, it does not change. We’ll call this a primitive datam.

We often need collections of datam. In the early days, the min and max numbers in the collections were fixed. Later we accepted that it could be a huge number, but keep in mind that it is never infinite. There is always a fixed limitation, it is just that it might not be easily computable in advance.

Collections can have dimensions and they can have structure. A matrix is 2 dimensions, a tree is a structure.

All of the data we can collect in our physical universe fits into this. It can be a single value, a list or set of them, a tree, a directed acyclic graph, a full graph, or even possibly a hypergraph. That covers it all.

I suppose some data out there could need a 14-dimension hypergraph to correctly represent it. I’m not sure what that would look like, and I’m probably not going to encounter that data while doing application programming.

Some of the confusion comes from things that we’ve faked. Strings are a great example of this. If you were paying attention, a character is a primitive datam. A string is an arbitrary list of characters. That list is strictly ordered. The size of a string is mostly variable, but there are plenty of locations and technologies where you need to set a max.

So, a string is just a container full of characters. Doing something like putting double quotes around it is a syntactic trick to use a special character to denote the start of the container, and the same special character to denote the end. Denoting the start and end is changing the state of the interpretation of those characters. That is, you need a way of knowing that a bunch of sequential datam should stay together. You could put in some sort of type identifier and a size, or you could use a separator and an implicit end which is usually something like EOF or EOS. Or you can just mark out the start and end, as we see commonly in strings.

Any which way, you add structure on top of a sequence of characters, but people incorrectly think is itself a primitive datam. It is not. It is actually a secret collection.

The structural markers embedded in the data are data themselves. Given a data format, there can be a lot of them. They effectively are meta-data that tells one how to collect together and identify the intervening data. They can be ambiguous, noisy, unbalanced, and a whole lot of other issues. They sometimes look redundant, but you could figure out an exact minimum for them to properly encode the structure. But properly encoding one structure is not the same as properly encoding all structures. The more general you make things, the more permutations you have to distinguish in the meta-data, thus the noisier it will seem to get.

Given all that pedantry, you could figure out the minimum necessary size of the meta-data with respect to all of the contexts it will be used for. Then you can look at any format and see if it is excessive.

Then the only thing left to do is balance out the subjectiveness of the representation of the markers.

If you publish that, and you explicitly define the contexts, then you have a format whose expressibility is understood and is as nice as possible with respect to it, and then what’s left is just socializing the subjective syntax choices.

In that sense, you are just left with the primitive datam and containers. If you view data that way, it gets a whole lot simpler. You are collecting together all of these different primitives and containers, and you need to ensure that any structure you use to hold them matches closely to the reality of their existence. We call that a model, and the full set of permutations the model can hold is its expressiveness. If you want the best possible representation then you need to tighten down that model as much as possible, without constricting it so much that legitimate data can’t be represented.

From time to time, after you have collected it, you may want to move the data around. But be careful, only one instance of that data should be modifiable. Merging structure reliably is impossible. The effort to get a good model is wasted if the data than just haphazardly spread everywhere.

Data is the foundation of every system. Modeling properly can be complex, but the data itself doesn’t have to be the problem. The code built on top is only as good as the data below it. Good code with bad data is useless. Bad code with good data is fixable.

No comments:

Post a Comment

Thanks for the Feedback!