The Programmer's Paradox: Types of Data

The primary goal of any software system is to collect data.

Some of that data represents entities in the real world, it is raw information that is directly captured from the domain. Most systems store this raw data as simply as possible, although it usually gets re-keyed to align with the rest of the system, and sometimes the structure is rearranged. Usually it's preserved in its rawest format, so that it is easier to double check its correctness, and update it later if there are corrections.

Raw data accounts for the bulk of most data in the system, and it is usually at the core of allowing the functionality to really solve problems for the user. In most domains, it comes as a finite set of major entities whose structure is driven by the real world collection techniques involved in getting it.

Raw data has breadth or depth, and occasionally both. By breadth, I mean that there are a large number of underlying parts to it, such that if the data were stored in a normalized schema, there would be a huge number of different, interconnected tables. Healthcare is an example of a domain that has considerable breadth, given that there are a lot of different types of diseases, diagnostics and treatments.

By depth, I mean that the same entities are repeated over and over again, on masse. Intraday stock data is an example of depth. On an entity by entity basis, both breadth and depth usually don't exist together, although some domains are really combinations of the two. Bonds for example can have a large number of different flavors, and they also have lots of different intraday valuations. But the master data is always a different entity than the daily quotes.

Mostly depth tends to be static with regard to structure. Some domains have entities of static breadth, but more often than not the breadth is dynamic. It changes on a regular basis, which is often the reason it has so much substructure in the first place. Dynamic breadth can be contained via abstraction, but it becomes tricky for most people to understand and visualize the abstract relationships, so it isn't common practice. Instead there is continual scope creep.

Since computers can conveniently augment the original data, systems usually have a fair amount of derived data. Sometimes this data is calculated by extending the relationships, sometimes it is just basic summary information. Derived data is more often calculated on-the-fly, usually to avoid staleness and inconsistency. Some derived data is composed entirely of calculations using raw data. Some derived data is built on other derived values. Visualizing the dependencies as a directed graph, it is always true that all terminal nodes are raw data. That is, what goes into the calculations must come from somewhere.

The computations used are often called 'business logic', but some of the more esoteric ones really encompass the irrationality of historic conventions. Derived data is most often fast to calculate, but there are plenty of domains where it can take hours, or even perhaps days. Those types of big calculations are usually stored in the persistence technology, and as such are sometimes confused with raw data.

Raw data is often self-explanatory in terms of its structure as stored persistently. Derived data however usually requires deeper explanations, and in most systems is the predominate source of programming bugs. Not only do the programmers need to understand how it is constructed but the users do as well, and as a consequence of this most derived data calculations are either reasonably straight-forward or they are created from a component built and maintained by external domain experts who are authoritative references.

In well-organized designs, the derived data calculations are all collected together in some consistent subsystem. This makes it easier to find and correct problems. If the calculations are scattered throughout the code base, then inconsistencies are common and testing is very expensive.

The underlying calculations can and should be implemented in a stateless manner. That is all of the required data should be explicitly passed into the calculations. In some domains, since the underlying data triggers branches, it is not uncommon to see the code inverted. As the calculation progresses, the branches cause auxiliary data to be gathered. Thus if the data is type A, then table A is loaded and if it is B, then table B is loaded. This type of logic is easy to build up, but it complicates the control and error handling in the code, it also diminishes the reusability.

It takes a bit of thinking, but virtually all derived calculations can be refactored to have the necessary data passed in as the initial inputs. In that way, all valid rows from both table A and B are input, no matter what the underlying type. It is also true that the performance of coding it this way will be better. In a big system, the branching won’t gradually degenerate into too many sub-requests. Given this, derived data should always be implemented as a set of straight-forward functions that are stateless and entirely independent from the rest of the system. There should be no side-effects, no global access and no need for external interactions. Just a black-box engine that takes data and returns derived results. Getting derived data fully encapsulated makes it really easy to extend, and deploy in a wide variety of circumstances. In the rare circumstance where this is not possible for some tricky special cases, the rest of the derived data in the system should still be implemented correctly (one small special case does not excuse ignoring best practices).

The third type of data in a system is presentation data, but given the above definitions of raw and derived, initially it might not seem like there is a lot of data in this category. However it is surprisingly large. Presentation is primarily about how we dress up data for the screen, but it also implicitly includes the navigational information about how the user got to the screen in the first place. Thus we have decorative data, like font choice, colors, etc. but we also have things like session ids, the current primary data key and any of the options that the user has set or customized. In most professional systems that actually includes a lot of user, group and system preferences, as well as authentication and authorization. Basically it’s anything that isn't raw, or derived from the raw data, which in modern systems is considerable.

Presentation data has often confused programmers, particularly with respect to patterns like model-view-controller (MVC). There is a tendency to only think of the model as the raw data, then implement the derived data as methods, and leave out the presentation data all together. If done this way, the construction generally degenerates into a mess. The model should be the model for all of the data in the whole system, and act as a 1:1 mapping between main entities and objects. Thus the model would contain sets of objects for all three types of data, which would include, for example, an object that encapsulates the current navigational information for a given user session. It would also include any fiddly little switches on the screen that the user has selected.

In most cases, the objects in the model would be nouns and their interactions would be equivalent to the way that the user describes what is happening at an interface level. The model, in this sense, is essentially object wrappers for any and all structural data within the system, at any time. In most cases, one can stop at the main entity level or one below, not all substructural data needs to be explicitly encapsulated with objects, but the depth should be consistent. Creating the full model for the system means that it is well-defined where to put code like validation and decoration. It also means that there is another 1:1 mapping between the widgets in the UI and the model. This again keeps the code organized and clearly defines only one place for any new logic.

An organized system is one where there is only one valid location for any of the code, and a well constructed system is where everything is exactly where it should be.

There are really only three types of data in a system: raw, derived and presentation. These percolate throughout the code. Most code in large systems is about moving the three types data around between the interface and the persistence technology. If the three types are well-understood, then very straight forward organizational schemes can be applied to keep them encapsulated and consistent. This schemes considerably reduce the amount of code needed, saving time and reducing testing and bugs. Most programs don't have to be as complicated as they have become, not only are they hard to extend, but it also decreases their performance. If we step back from the code, we are able to see these similarities and leverage them with simple abstractions and consistent rules. Minimal code is directly tied to strong organization, that is spaghetti code is always larger than necessary and usually disorganized on multiple different levels. If the data management is clean, the rest of the system cleanly falls into place.

The Programmer's Paradox

Sunday, March 29, 2015

Types of Data

No comments:

Post a Comment