Sunday, March 29, 2015

Types of Data

The primary goal of any software system is to collect data.

Some of that data represents entities in the real world, it is raw information that is directly captured from the domain. Most systems store this raw data as simply as possible, although it usually gets re-keyed to align with the rest of the system, and sometimes the structure is rearranged. Usually it's preserved in its rawest format, so that it is easier to double check its correctness, and update it later if there are corrections.

Raw data accounts for the bulk of most data in the system, and it is usually at the core of allowing the functionality to really solve problems for the user. In most domains, it comes as a finite set of major entities whose structure is driven by the real world collection techniques involved in getting it.

Raw data has breadth or depth, and occasionally both. By breadth, I mean that there are a large number of underlying parts to it, such that if the data were stored in a normalized schema, there would be a huge number of different, interconnected tables. Healthcare is an example of a domain that has considerable breadth, given that there are a lot of different types of diseases, diagnostics and treatments.

By depth, I mean that the same entities are repeated over and over again, on masse. Intraday stock data is an example of depth. On an entity by entity basis, both breadth and depth usually don't exist together, although some domains are really combinations of the two. Bonds for example can have a large number of different flavors, and they also have lots of different intraday valuations. But the master data is always a different entity than the daily quotes.

Mostly depth tends to be static with regard to structure. Some domains have entities of static breadth, but more often than not the breadth is dynamic. It changes on a regular basis, which is often the reason it has so much substructure in the first place. Dynamic breadth can be contained via abstraction, but it becomes tricky for most people to understand and visualize the abstract relationships, so it isn't common practice. Instead there is continual scope creep.

Since computers can conveniently augment the original data, systems usually have a fair amount of derived data. Sometimes this data is calculated by extending the relationships, sometimes it is just basic summary information. Derived data is more often calculated on-the-fly, usually to avoid staleness and inconsistency. Some derived data is composed entirely of calculations using raw data. Some derived data is built on other derived values. Visualizing the dependencies as a directed graph, it is always true that all terminal nodes are raw data. That is, what goes into the calculations must come from somewhere.

The computations used are often called 'business logic', but some of the more esoteric ones really encompass the irrationality of historic conventions. Derived data is most often fast to calculate, but there are plenty of domains where it can take hours, or even perhaps days. Those types of big calculations are usually stored in the persistence technology, and as such are sometimes confused with raw data.

Raw data is often self-explanatory in terms of its structure as stored persistently. Derived data however usually requires deeper explanations, and in most systems is the predominate source of programming bugs. Not only do the programmers need to understand how it is constructed but the users do as well, and as a consequence of this most derived data calculations are either reasonably straight-forward or they are created from a component built and maintained by external domain experts who are authoritative references.

In well-organized designs, the derived data calculations are all collected together in some consistent subsystem. This makes it easier to find and correct problems. If the calculations are scattered throughout the code base, then inconsistencies are common and testing is very expensive.

The underlying calculations can and should be implemented in a stateless manner. That is all of the required data should be explicitly passed into the calculations. In some domains, since the underlying data triggers branches, it is not uncommon to see the code inverted. As the calculation progresses, the branches cause auxiliary data to be gathered. Thus if the data is type A, then table A is loaded and if it is B, then table B is loaded. This type of logic is easy to build up, but it complicates the control and error handling in the code, it also diminishes the reusability.

It takes a bit of thinking, but virtually all derived calculations can be refactored to have the necessary data passed in as the initial inputs. In that way, all valid rows from both table A and B are input, no matter what the underlying type. It is also true that the performance of coding it this way will be better. In a big system, the branching won’t gradually degenerate into too many sub-requests. Given this, derived data should always be implemented as a set of straight-forward functions that are stateless and entirely independent from the rest of the system. There should be no side-effects, no global access and no need for external interactions. Just a black-box engine that takes data and returns derived results. Getting derived data fully encapsulated makes it really easy to extend, and deploy in a wide variety of circumstances. In the rare circumstance where this is not possible for some tricky special cases, the rest of the derived data in the system should still be implemented correctly (one small special case does not excuse ignoring  best practices).

The third type of data in a system is presentation data, but given the above definitions of raw and derived, initially it might not seem like there is a lot of data in this category. However it is surprisingly large. Presentation is primarily about how we dress up data for the screen, but it also implicitly includes the navigational information about how the user got to the screen in the first place. Thus we have decorative data, like font choice, colors, etc. but we also have things like session ids, the current primary data key and any of the options that the user has set or customized. In most professional systems that actually includes a lot of user, group and system preferences, as well as authentication and authorization. Basically it’s anything that isn't raw, or derived from the raw data, which in modern systems is considerable.

Presentation data has often confused programmers, particularly with respect to patterns like model-view-controller (MVC). There is a tendency to only think of the model as the raw data, then implement the derived data as methods, and leave out the presentation data all together. If done this way, the construction generally degenerates into a mess. The model should be the model for all of the data in the whole system, and act as a 1:1 mapping between main entities and objects. Thus the model would contain sets of objects for all three types of data, which would include, for example, an object that encapsulates the current navigational information for a given user session. It would also include any fiddly little switches on the screen that the user has selected.

In most cases, the objects in the model would be nouns and their interactions would be equivalent to the way that the user describes what is happening at an interface level. The model, in this sense, is essentially object wrappers for any and all structural data within the system, at any time. In most cases, one can stop at the main entity level or one below, not all substructural data needs to be explicitly encapsulated with objects, but the depth should be consistent. Creating the full model for the system means that it is well-defined where to put code like validation and decoration. It also means that there is another 1:1 mapping between the widgets in the UI and the model. This again keeps the code organized and clearly defines only one place for any new logic.

An organized system is one where there is only one valid location for any of the code, and a well constructed system is where everything is exactly where it should be.

There are really only three types of data in a system: raw, derived and presentation. These percolate throughout the code. Most code in large systems is about moving the three types data around between the interface and the persistence technology. If the three types are well-understood, then very straight forward organizational schemes can be applied to keep them encapsulated and consistent. This schemes considerably reduce the amount of code needed, saving time and reducing testing and bugs. Most programs don't have to be as complicated as they have become, not only are they hard to extend, but it also decreases their performance. If we step back from the code, we are able to see these similarities and leverage them with simple abstractions and consistent rules. Minimal code is directly tied to strong organization, that is spaghetti code is always larger than necessary and usually disorganized on multiple different levels. If the data management is clean, the rest of the system cleanly falls into place.

Saturday, March 7, 2015

The Data

A very long time ago, I bought my very first -- brand spanking new -- computer. It was an XT clone with a turbo button. I joyfully set it up on the desk in my bedroom, I was very excited.

My grandmother came to look at my new toy. "What does it do?" she asked. I went on a rather long explanation about the various pieces of software I had already acquired: games, editors, etc.

"Sure, but what does it do?" she asked again.

At the time, I was barely able to afford the machine and the monitor, so there was no printer attached nor a modem and this was long before the days of the Internet. I tried to explain the software again, but she waved her hand and said "so it doesn't really do anything, does it?"

I was somewhat at a loss for words. It 'computed', and I could use that power to play games or make documents, but really all on it's own, yes, it doesn't actually do anything. It doesn't print out documents, it doesn't help with chores, and at the time it wasn't going to make me any money or improve my life in any way. She had a really valid point. Also, it was very expensive, having drained a significant amount of cash out of my bank account. My bubble had been burst.

I never forgot that conversation because it was so rooted in a full understanding of what a general purpose computer really is. It 'computes' and on its own, that is all it really does. Just hums along quietly manipulating internal symbols and storing the results in memory or on a disk. For long, tedious computations it is quite a useful time saver, but computation, all by itself, doesn't have any direct impact on our world.

Then the mid-90s unexpectedly tossed my profession out of the shadows. Before the Internet explosion, if I explained what I did for a living people would give me pitting looks. After, when computers leaped into the mainstream of everyone's lives, they seemed genuinely interested. It was quite the U-turn.

Still, for all of the hype, computers didn't fully emerge on the landscape until social networking came along. Before that, for most people they were mere playthings stored in an unused bedroom and fiddled with occasionally for fun. What changed so radically?

What people see when they go online is certainly not the code that is running in the background. And they really only see the interface if it is hideous or awkward. What they see is the data. The stuff from their friends, the events of the world, things for sale and the stories and documents that people have put together to entertain or explain. It's the data that drives it all. It's the data that has value. It's the data that draws them back, over and over again. What computers do these days is collect massive amounts of data and display it in a large variety of different ways. You can search, gain access, then explore how it all relates back to itself.

The best loved and most useful programs are the ones that provide seamless navigation through large sets of structured data. They allow a user to see all of what's collected, and to explore its relationships. They often augment these views with other 'computed' data derived from the raw underlying stuff, but mostly that's just decoration. Raw data is what drives most sites, applications, etc.

Given the large prominence of data within our modern applications, it is somewhat surprising that most programmers still see programming as assembling 'lists' of instructions. That's probably because the first skills they learn are to cope with are branches and looping. That gradually leads them to algorithms and by the time they are introduced to data-structures they are somewhat overwhelmed by it all.

That all too common path through the knowledge leaves most people hyper-focused on building up ever more code, more lists of instructions, even if that's only a secondary aspect of the software.

A code-specific perspective of programming also introduces a secondary problem. That is, variables are seen as independent things. One creates and manipulates a number of different variables to hold the data. They may be 'related', but they are still treated as independent entities.

In sharp contrast, the user's perspective of a program is completely different. What they see is the system building up ever more complicated structures of closely related data that they can navigate around. Variables aren't independent, rather they're just the lowest rung that holds a small part of the complex structure. The value comes from the structure, not the underlying pieces themselves. That is, a comment on a blog includes all of the text, the user name and link, the date and time when it was published and possibly some ranking. These are not independant things, they all exist together as one unit. But we also need the whole stream of comments, plus the original post to fully understand what was said. The value of a comment isn't a single block of text, it is the whole complex data-structure surrounding a particular comment on a blog entry somewhere. Individual data points are useless if they are not placed in their appropriate context.

There may be a huge amount of underlying code involved in getting a blog entry and its comments out onto a web page, but from the functionality standpoint it is far simpler to just think of it as a means to shift complex data-structures from somewhere in persistent storage to a visible presentation. And with that data-flow perspective, we can quickly get a solid understanding of any large complex system, particularly if in the end it only really displays a small number of different entities. The data may flow around in a variety of different ways, but most of that code is really just there to convert it into alternative structures as it winds its way through the system. The minimal inherent complexity of a system comes directly from the breadth of the complex data-structures that it supports.

Obviously, if you wanted to heavily optimize a big system, reducing the number of variables and intermediate representations computed along the main access pathways would result in significant bloat reduction. And not quite as obvious, it would also make the code way more readable, in that the programmers would not have to understand or memorize nearly as many different formats or manipulations while working on it. The most efficient code reduces the effort required to move data around, and also reuses any intermediate calculations as appropriate.

That in it's own way leads to a very clear definition of elegance. We will always have to live with the awkwardness of any of the language constructs used in the code, but looking past those, a really elegant program is one that is organized so that the absolute minimal effort is used to move the data from persistence to presentation, and back again. In that sense if you have a large number of complex data entities, they are slurped up from persistence to an internal 'model' that encapsulates their structure. From there they are transported as is, to their destination, where they are dressed up for display. If in between, there isn't any fiddling or weird logic, then the work has been minimized. If the naming convention is simple, readable and consistent, and any algorithmic enhancements (derived data from the raw stuff) are all collected together, then the code is masterfully done. Not only is it elegant, but in being so, it is also fast and very easy to enhance if you continue to follow its organization.

Viewing a large system as huge lists of instructions is complicated. Viewing it as various data-structures flowing between different end-points is not. In fact it leads to being able to encapsulate the different parts of the system away from each other very naturally. We can see, for instance, that reading and writing data to persistent storage stands on its own. Internally we know that we can't keep all of the data from the database in memory at once, so setting how much is necessary for efficiency is important. We can split off the actual presentation of any data from it's own inherent categorization. We can also clearly distinguish between raw data and the stuff that needs to be derived. This allows us to isolate larger primitives, lego blocks that can be reused throughout the code, simply because they are directly involved with manipulating larger and larger more complex structures. As we build up the structures, we pair them with suitable presentation mappings. And most of this can be driven dynamically.

In a very real sense, the architecture and the bottom up construction of the fundamental blocks comes directly from seeing the system as nothing more than a flow of data. That this also matches the way the users see and interact with the system is an added bonus.

Data really is the defining fundamental thing in building any system. The analysis and design start out with finding the data that is necessary to solve a specific problem. Persistence is the foundation for all of the work built on top, since it sets the structure of the data. The architecture is rooted on organizing all of the moving parts for the data-flow. The code, inspired both from the history of ADTs and Object Oriented, can be encapsulated along side its specific data, and for any of the stuff that does not fit elsewhere, like derived data, it can be collected together so that it is easily found and understood. Data is the anchor for all of the other parts of development.

Data drives computers, without it they are just general purpose tools with no specific purpose. In so many ways my Grandmother was correct, but what both she and I both failed to see all those decades ago was that as soon as we started filling up these machines with enough relevant data, they did suddenly become very, very useful. Code is nice, but it's not really what's important in software. It's just the means to access the data.  And for our young industry, it's definitely the case that we have only just begun to learn how to construct systems, once we've mastered complexity and can manipulate large amounts of data reliably, computers will transform our societies in all sorts of ways that are currently unimaginable.