Thursday, June 6, 2024

Data Modelling

The strength and utility of any software system is the data that it persists.

Persisting junk data is just wasting resources and making it harder to utilize any good data you have collected.

Storing partial information is also bad. If you should have collected more data, but didn’t, it can be difficult or impossible to go back and fix it. Any sort of attempt to fake it will just cause more problems.

All data is anchored in the real world, not just the digital one. That might not seem generally true, for example, with log files. But those log entries come from running on physical hardware somewhere and are tied to physical clocks. While they are data about software running in the digital realm, the act of running itself is physical and always requires hardware, thus the anchor.

All data has a form of uniqueness. For example, users in a system mostly match people in reality. When that isn’t true, it is a system account of some type, but those are unique too, and usually have one or more owners.

For temporal data, it is associated with a specific time. If the time bucket isn’t granular enough, the uniqueness could be improperly lost. That is an implementation bug, not an attribute of the data.

For geographical data, it has a place associated with it, and often a time range as well.

Events are temporal; inventory is geographical. History is temporal. We’re seeing a pattern. Data is always anchored, therefore it has some type of uniqueness. If it didn’t it would at least have an order or membership somewhere.

Because data is unique we can always key it. The key can be composite or generated, there can be different types of unique keys for the same data, which is often common when cross-referencing data between systems. Mostly, for most systems, to get reasonable behavior we need keys so we can build indices. There are exceptions, but the default is to figure out exactly what makes the data unique.

Data always has a structure. Composite data travels together, it does not make sense on its own. Some data is in a list, or if the order is unknown, a set.

Hierarchical data is structured as a tree. If subparts are not duplicated but cross-linked it is a directed acyclic graph (dag). If it is more indescribably linked it is a graph, or possibly a hypergraph.

If data has a structure you cannot ignore it. Flattening a graph down to a list for example will lose valuable information about the data. You have to collect and store the data in the same structures as it exists.

All datum have a name. Sometimes it is not well known, sometimes it is very generalized. All data should be given a self-describing name. It should never be misnamed, but occasionally we can skip the name as it is inferred.

Understanding the name of a datum is part of understanding the data. The two go hand in hand. If you cannot name the parts, you do not understand the whole.

If you don't understand the data, the likelihood that you will handle it incorrectly is extremely high. Data is rarely intuitive, so assumptions are usually wrong. You don’t understand a domain problem until you at least understand all of the data that it touches. You cannot solve a problem for people if you don’t understand the problem itself. The solution will not cover the problem correctly.

There are many forms of derived data. Some derivations are the same information just in a different form. Some are composites. Some are summaries of the data along different axes. An average for some numbers is a summary. You can get to the average from the numbers, just redo the calculation, but you cannot go the other way. There are infinitely many sets of numbers that would give the same average.

Technically, derived data can be unique. You can have the composite key for all of the attributes of the underlying data that you are combining to get the value. Realistically we don’t do that very often. It’s often cheaper and safer to just regenerate it as needed.

Most data that is useful is complex. It is a large combination of all sorts of intertwined other data structures. As we present it in different ways it helps people make better decisions. It has recursively diverse sub-structures, so it might be a list of trees of graphs of lists, for example. From that perspective, it is too complex for most people to grapple with, but we can adequately work through each of the subparts on their own and then recombine them effectively.

Sometimes capturing all of the available data for something is just too large. We ignore some dimensions or attributes and only capture parts of the data relative to a given context. That is common, but for any given piece of data within that context, we still don’t want it to be partial which is junk. A classic example is to capture a graph of roads between cities without capturing it as a geological database. We dropped the coordinate information, but we still have captured enough of it to properly identify the cites, so later, if required, we can reconnect the two different perspectives. Thus you may not want or need to capture everything, but you still have to be careful about which aspects you don’t capture, which means you still need to understand them.

Data modeling then is two very different things. First, you are setting up structures to hold the data, but you are also putting in constraints to restrict what you hold. You have to understand what you need to capture as well as what you need to avoid accidentally capturing. What you cannot store is as important as what you can store. A classic example is an organization where people can report to different bosses, at the same time. Shoving that into a tree will not work, it needs to be a dag. You would not want that in a graph, it would allow for cycles. If you need to know about who is reporting to whom, you need to capture it correctly. Getting it wrong is misleading the users which is very bad.

Limits of understanding and time constraints are often used as excuses for not spending time to properly model the data, but most difficult bugs and project delays come directly from not properly modeling the data. Data anchors all of the code, so if its representations are messed up, any code built on top is suspect and often a huge waste of time. Why avoid understanding the data only to rewrite the code over and over again? It’s obviously slower.

There is a lot more to say about modeling. It is time-consuming and pedantic, but it is the foundation of all software systems. There is a lot to understand, but skipping it is usually a disaster. Do it correctly and the code is fairly straightforward to write. Do it poorly and the hacks quickly explode into painful complexity that always spirals out of control.

2 comments:

Thanks for the Feedback!