Friday, October 31, 2025

The Structure of Data

A single point of data -- one ‘variable’ -- isn’t really valuable.

If you have the same variable vibrating over time, then it might give you an indication of its future behavior. We like to call these ‘timeseries’.

You can clump together a bunch of similar variables into a ‘composite variable’. You can mix and match the types; it makes a nexus point within a bunch of arbitrary dimensions.

If you have a group of different types of variables, such as some identifying traits about a specific person, then you can zero in on uniquely identifying one instance of that group and track it over time. You have a ‘key’ to follow it, you know where it has been. You can have multiple different types of keys for the same thing, so long as they are not ambiguous.

You might want to build up a ‘set’ of different things that you are following. There is no real way to order them, but you’d like them to stay together, always. The more data you can bring together, the more value you have collected.

If you can differentiate for at least one given dimension, you can keep them all in an ordered ‘list’. Then you can pick out the first or last ones over the others.

Sometimes things pile up, with one thing about a few others. A few layers of that and we get a ‘tree’. It tends to be how we arrange ourselves socially, but it also works for breaking down categories into subcategories or combining them back up again.

Once in a while, it is a messy tree. The underlying subcategories don’t fit uniquely in one place. That is a ‘directed acrylic graph’ (dag) which also tends back to some optimizing forms of memoization.

When there is no hierarchical order to the whole thing it is just a ‘graph’. It’s a great way to collect things, but the flexibility means it can be dangerously expensive sometimes.

You can impose some flow, making the binary edges into directional ones. It’s a form of embedding traits into the structure itself.

But the limits of a single-dimensional edge may be too imposing, so you could allow edges that connect more than one entry, which is called a ‘hypergraph’. These are rare, but very powerful.

We sometimes use the term ‘entity’ to refer to our main composite variables. They relate to each other within the confines of these other structures, although we look at them slightly differently in terms of, say, 1-to-N relationships, where both sides are effectively wrapped in sets or lists. It forms some expressive composite structures.

You can loosely or tightly structure data as you collect it. Loose works if you are unsure about what you are collecting, it is flexible, but costly.. Tight tends to be considerably more defensive, less bugs, and better error handling.

It’s important not to collect garbage; it has no inherent value, and it causes painful ‘noise’ that makes it harder to understand the real data.

The first thing to do when writing any code is to figure out all of the entities needed and to make sure their structures are well understood. Know your data, or suffer greatly from the code spiraling out of control. Structure tends to get frozen far too quickly; just trying to duct tape over mistakes leads to massive friction and considerable wasted effort. If you misunderstood the structure, admit it and fix the structure first, then the code on top.

No comments:

Post a Comment

Thanks for the Feedback!