Sunday, September 13, 2015

Data Modelling

Data is the foundation for all software. The strength of its internal relationships makes it possible for computers to assist us with real world problems. Without carefully modelled data, code just spins uselessly.

A collected datum by itself has little value. It needs both recurring frequency and the interconnections between other related points. However, collected together, a group of points identifies a deeper meaning; it pinpoints something specific. If we collect enough data, and we model it appropriately, we can leverage this information to help us.

There hasn't been any real growth in the area of modelling for at least 20 years; the strongest tools we currently have for modelling are ER diagrams. They are backed by a deeper set of entity modelling methodologies, but few brave souls ever descend to those depths.

In an ER diagram each datum is an attribute, they are combined together to form an entity. We can then specify relationships between these various entities. For most data with relatively static relationships this works quite well. Analysis and modelling is well-understood and easy to apply, although it often seems to have been ignored.

Relationships between entities happen on both ends as 0, 1 or many times. This can capture any entity with a set of children or even a many-to-many entity that multiplexes between two others. The allowance of 0 is really an offhanded means of supporting optional relationships. It's expressive, but somewhat overloaded.

ER diagrams map directly back relational databases (entity -> table) and have simple (although somewhat obtuse) rules for how to appropriately normalize the structure in order to maximize the functionality. Normalization, unfortunately, is also often ignored, resulting in severe operational problems and damaging the ability to extend the usefulness of the code later. It's sad in that it isn't difficult knowledge to acquire and is essentially mandatory if a relational database is needed for persistence. If the foundations are broken, everything built on top of them inherits that disfunction.

ER diagrams are not particularly good for complex data-structures like timeseries, trees and graphs. This is unfortunate, since as we tackle more sophisticated problems these structures are far more frequent. Still, it doesn't take much fiddling to adapt the basic ideas to handle them. A data-structure is just a recursive container for variables and other data-structures. An entity is also a container, but not recursive, so the extension is obvious although it is difficult to know whether to express the recursion internally in the entity or as an external relationship. The latter is clearer in breadthwise data, while the former is clearer when coping with depth.

Decomposing complex data models into data-structures and their relationships makes it easy to map the data back to a model in code, but care must be taken. Turing-complete programming languages most often appear as considerably more expressive than the core relational model. I'm not sure if this is theoretically true, but it has shown itself to be practically the case; most often made obvious by programmers writing their internal code first, then getting into severe problems when trying to connect it back to the schema. For this reason, for any persistent data, it is far superior to sort out the representation in the persistent technology first, then map it in a 1:1 fashion back to the code. That remains true no matter what programming paradigm is being used. It also provides justification for viewing the persistence as a 'foundation' on which all other stuff is built.

As such, while user interfaces should be constructed top-down (to ensure empathy with the users), data should be built from the bottom up. A slight contradiction, but resolved by always dealing with the data first, then once that has been completed then it can be connected to the user interface. It's an inverted T pattern. Lay the foundations, extend it to the interface, then rinse and repeat to span it out to the full functionality. Built in that order, it opens up the ability to maximize reuse, which saves massive resources.

Doing the actual modelling work is fairly easy. Find out all of the currently existing data, figure out what other data will also need to be captured, then start bringing all of the attributes together into entities. A whiteboard really helps, in that in order to work through all of the special and corner cases it sometimes takes several rounds of re-arranging the structure until it fits appropriately. Still, it is way faster and a lot cheaper to work through the issues in design first, before they become trapped in a large code base. Modelling usually requires a bunch of people to work together to avoid tunnel vision.

For a very complex schema with a huge number of entities, the same basic structural patterns will repeat over and over again. Much like code reuse, we would prefer to generalize these common structural relationships and only implement them once, saving both time and resources. Generalization in this fashion is a form of abstraction that can be seen as collapsing as many special cases as possible into a minimal number of unique structures. Obviously to keep it readable the terminology must lift upwards and appropriately apply to the breadth of all included special case terminology. A generalization that lies, is inappropriate. One that abstracts the structure allowing enough flexibility to hold all special cases is a huge asset. Generalizing however isn't independent of performance and resource issues. For instance you can store all of the data as an entity-attribute-value (EAV) schema, in long skinny tables, but care needs to be taken because the number of rows now has a strong multiplier, which can kill performance. A table with millions of rows isn't too bad on modern computers, but handling billions of rows isn't so easy. This type of generalization should always be prototyped first, to avoid burning through lots of resources on predictable problems.

Our current understanding of data modelling is predicated on the relationships being static. We essentially model the rather informal real world as a very rigid formal system. In some constrained contexts, like an accounting or inventory system, this works fine, but as we start to try to build smarter more helpful software this rigidity is a huge problem. It would seem that the best option then is to include no structure, just let the data land where it may. That's not actually practical; a fully free floating clump of data, like a massive text field is essentially worthless because we cannot deterministically extract knowledge from it (humans can, but not computers). If we decompose everything, type it in and then build up dynamic graphs to capture the relationships we now have something useful, but we will run afoul of computational complexity. That is, the resource usage will grow exponentially as we add just a little bit more data, which will rapidly make the program so slow it has become useless again. The time will quickly jump from a few seconds for a calculation to days, months and eventually years.

The trade-off between static and dynamic relationships is extraordinarily difficult and has not nearly been explored enough. Given this, the strongest approach is to model rather statically at first, then gradually generalize and make relationships dynamic. A careful step-by-step progression, checking for real performance issues at each iteration, can lead to finding the most appropriate level of dynamic behaviour for any given model. It certainly isn't fast work, and although experience can speed it up, in a big data model the inherent context is well beyond our ability to visualize the behaviour. That means that each step must be fully tested before the next one is taken. Although it is costly to build, dynamic data better matches the real world, thus making the software super-flexible. Done well, this incorporates the type of intelligence and sophistication that the users so desperately need to get their problems really solved, without tossing in negative side-effects.

Over the years, whenever I've encountered really bad software, the all too common thread has been that the data modelling was ignored. There was a plethora of excuses, plenty of hollow justifications, but none of that really mitigated the severe damage that erratic data caused in the code base, the development project and in some cases the business itself. I've generally taken bad modelling to be a symptom of other more severe problems, but once it gets entrenched it anchors the development far away from the possibility of success. That is, there is never a turnaround in the fortunes of a failing development project, not a real one, until the foundations of the project have been addressed and fixed. Fix the modelling, fix the schema, then get in all of the refactoring required to fix the code, then at least from a development perspective the project now has a fighting chance of survival. Continue to avoid acquiring any knowledge about modelling and ignoring the consequences of that, then the project will just get worse. It doesn't really matter what the larger environmental or political problems are, if the foundations for a software development project are fubar'ed. If there is no upper-level will to correct this, go find something else to do with your life, it will be infinitely more satisfying.