Monday, September 4, 2017

Data Modeling

The core of most software systems is the collection of data. This data is rarely independent; there are many inter-relationships and outward associations with the real world. Some of these interconnections form the underlying structure of the data.

If the system doesn’t collect the data or if the data collected is badly structured, it can be extraordinarily expensive to fix it later. Most often, if there are deficiencies in the data that are too expensive to fix that leads the software developers to overlay convoluted hacks in an attempt to hide the problems. Long-term application of this strategy eventually degrades the system to the point of uselessness.

As such, data forms the foundation for all software systems. Code without data is useless, and unlike data, code can be easily changed later. This means that when building big complex systems, more care should be spent on ensuring that the necessary data is collected and structured properly than any other aspect of the system. Getting the data wrong is a fatal mistake that often takes years to fully manifest itself.

A common way to model data is with an ER diagram. This decomposes the data into a set of ‘entities’ and provides a means of specifying ‘relationships’ between them. The relationships are oriented on frequency so they can be zero, one or many on either side. Thus one entity might be connected to many different related entities, forming a one-to-many relationship. All possible permutations are viable, although between multiple entities some collections of relationships are unlikely.

This is a reasonable way to model some of the structure, but it is hampered because its expressiveness is directly tied to relational databases. Their underlying model is to store the entities into tables which represent a two-dimensional means of encoding the data. This works for many common programming problems but is not able to easily capture complex internal relationships like hierarchies or graphs.

The older form of decomposing into data structures is far more expressive. You can break down extremely complex programs as a set of data structures like lists, hash tables, trees, graphs, etc. With those, and some associated atomic primitives as well as the larger algorithms and glue code, we have a means to specify any known form of computation. That decomposition works but does not seem to be generally well known or understood by the software industry. It is also a lot harder to explain and teach. It was softened way back to form the underlying philosophy behind object-oriented programming, but essentially got blended with other ideas about distributed computing, evolutionary design, and real world objects. The modern remnants of this consolidation are quite confusing and frequently misunderstood.

So we have at least two means of modeling data, but neither of them really fit well or are easy to apply in practice. We can, however, fuse them together and normalize them so that the result is both consistent and complete. The rest of this post will try to outline how this might be achieved.

Terminology and definitions form the basis for understanding and communication, so it is well worth consideration. From ER diagrams, the word ‘entity’ is general enough for our purposes and isn’t horrifically overloaded in the software industry yet. For the subparts of an entity, the common term is ‘attributes’. Often in other contexts, we use the term ‘properties’. Either seems suitable for describing the primitive data that forms the core of an entity, but it might be better to stick with ‘attributes’. It is worth noting that if some of this underlying associated data has its own inherent non-trivial structure, it can be modeled as a separate entity bound by a one-to-one relationship. Because of that, we can constrain any of these attributes to just being single instances of primitive data. No need for internal recursion.

From an earlier post on null handling (Containers, Collections and Null) we can set the rules that no instance of an entity can exist unless it has all of its mandatory attributes. If an attribute is optional for any possible instance of the entity, it is optional for all possible instances. These types of constraints ensure that the handling and logic are simple and consistent. They need to be rigorous.

Every entity corresponds to something unique in the real world. This is another rigorous constraint that is necessary to ensure the data is useful. If under some circumstance that isn’t true, then the entity corresponds to a ‘collection’ of such instances, not an instance itself. In this way, we can always bind the data to something unique, even if that is just a bunch of arbitrarily categorized collections. The caveat though, is that we should never just arbitrarily categorize because that creates problems (since we are adding in fake information). But instead, we need a deeper analysis of the underlying domain to resolve any ambiguities.

If we get a unique principle in place, then it is absolutely true that every entity has a unique key. In many circumstances this will be a composite key, but then any such instance refers to one and only one entity in the real world.

In developing systems we often create a system-generated key as a level of indirection to replace some of the original data keys. This is a good practice in that it allows some of the composite attributes to be modifiable, without having to create a new entity. In historic systems, this keeps the related data attached across changes. This is a necessary property for systems that deal with human centered input, mostly because there is a relatively constant rate of errors and long periods of time before they are noticed. That needs to be built into the data model.

So, with some tighter constraints, we can decompose all of the data into entities and we can show how they relate to each other, but we still need a means of showing how the entities relate internally to their own set. As an example, mostly the employees in a company form a tree or directed-acyclic graph with respect to their managers. We could model this with some messy arrangement that has multiple entity types allowing different people to both report upwards and have employees as well, but this would be quite awkward and pedantic. What is preferable is to just model the entity itself as a tree (ignoring the more complicated arrangements for the moment).
This could be as easy as specifying the type for the entity as a ‘tree’ and adding in an attribute for ‘manager’. We simply overlay the data structure decomposition on top of the original entity. Thus we get an extension, where the internal entity relationship might be a data structure like a tree and to satisfy that structure we add in the appropriate attributes.

That works cleanly, but we would like to know which set of data structures is necessary to express the full width of any possible data. A guess at that might start initially with this collection: item, list, tree, dag, graph, hypergraph. That may seem fairly expressive, but we are aware of at least some form of dimensionality with respect to entities. That is, the entity might be a list of lists, or even a hypergraph of trees of lists. Thus we have the initial collection on any given axis, with any N axises. It can be any such combination.

As weird as that sounds, we could then specify a matrix as an entity of list^2, but noting that the lists are strongly bounded by the M and N sizes of the matrix. In most underlying data collections, the entities are just not that complicated, but we really don’t want to limit the models to the 80% mark.

As for representing these models, we can easily use the original ER diagrams with the extra information for including the entity type. We would also like a means of serializing these and given that the relationships are not necessarily planar we need a bit of extra mechanics here as well. Before we get there, it is worth noting that the external relationships are bound as graphs. A reasonable extension would be to allow them to be hypergraphs as well, which would make them consistent with the internal relationships. That is, we could have a relationship such as one-many-many between three entities. That could further compact the model, but would only be usable if it was there to highlight a relationship, not obscure it.

We can easily serialize an entity (and we know it isn’t intrinsically recursive). So all we need is a means to serialize a hypergraph. That is well understood for graphs as being an adjacency matrix, but with the cells containing the relationship itself instead of just 1 or 0. Or we can just list out the vertices (entities) and the edges (relationships). Something like ‘entityA-entityB-entityC: one-many-many’ would be quite suitable. We could iterate any list of unnamed variables as brackets [], and any named variables (like entities) as braces {}. Another suggestion is to iterate keys first, separated internally by commas, but switch to a semicolon to separate them from the rest of the attributes or other keys (there may be multiple means of getting uniqueness). If there is some reason to distinguish entities from other hash tables, we could use parenthesis () for this distinction. Thus we might then differentiate between objects and dictionaries, for clarity.

With a bit of white space, this would all provide a means to have a serialized text based format for specifying a data model, which means that we could apply other tools such as source-code control and formatting to be able to manipulate these specifications. With the addition of positioning information, we could easily render this as a diagram when needed.

Given the importance of data for any software system, it seems that we should focus strongly on building up analytical tools for correctly laying the foundations for the system. It should be easy to model data and organizations should have large repositories of all of their current models. It is often the case that some systems use eclectic subsets of the actual real underlying data models, but what we need is both a way to catch this early and to gradually enhance our knowledge to properly model the data for all of the domains. A subset is fine, but when it needs to be extended, it should be done properly. It is reckless to leave this to chance, and we so often see the consequences of doing that. If the analysis stage for software development is properly focused on really modeling the data, the other stages will proceed smoothly. If not, the initial failures will percolate into the rest of the work.