Saturday, March 7, 2015

The Data

A very long time ago, I bought my very first -- brand spanking new -- computer. It was an XT clone with a turbo button. I joyfully set it up on the desk in my bedroom, I was very excited.

My grandmother came to look at my new toy. "What does it do?" she asked. I went on a rather long explanation about the various pieces of software I had already acquired: games, editors, etc.

"Sure, but what does it do?" she asked again.

At the time, I was barely able to afford the machine and the monitor, so there was no printer attached nor a modem and this was long before the days of the Internet. I tried to explain the software again, but she waved her hand and said "so it doesn't really do anything, does it?"

I was somewhat at a loss for words. It 'computed', and I could use that power to play games or make documents, but really all on it's own, yes, it doesn't actually do anything. It doesn't print out documents, it doesn't help with chores, and at the time it wasn't going to make me any money or improve my life in any way. She had a really valid point. Also, it was very expensive, having drained a significant amount of cash out of my bank account. My bubble had been burst.

I never forgot that conversation because it was so rooted in a full understanding of what a general purpose computer really is. It 'computes' and on its own, that is all it really does. Just hums along quietly manipulating internal symbols and storing the results in memory or on a disk. For long, tedious computations it is quite a useful time saver, but computation, all by itself, doesn't have any direct impact on our world.

Then the mid-90s unexpectedly tossed my profession out of the shadows. Before the Internet explosion, if I explained what I did for a living people would give me pitting looks. After, when computers leaped into the mainstream of everyone's lives, they seemed genuinely interested. It was quite the U-turn.

Still, for all of the hype, computers didn't fully emerge on the landscape until social networking came along. Before that, for most people they were mere playthings stored in an unused bedroom and fiddled with occasionally for fun. What changed so radically?

What people see when they go online is certainly not the code that is running in the background. And they really only see the interface if it is hideous or awkward. What they see is the data. The stuff from their friends, the events of the world, things for sale and the stories and documents that people have put together to entertain or explain. It's the data that drives it all. It's the data that has value. It's the data that draws them back, over and over again. What computers do these days is collect massive amounts of data and display it in a large variety of different ways. You can search, gain access, then explore how it all relates back to itself.

The best loved and most useful programs are the ones that provide seamless navigation through large sets of structured data. They allow a user to see all of what's collected, and to explore its relationships. They often augment these views with other 'computed' data derived from the raw underlying stuff, but mostly that's just decoration. Raw data is what drives most sites, applications, etc.

Given the large prominence of data within our modern applications, it is somewhat surprising that most programmers still see programming as assembling 'lists' of instructions. That's probably because the first skills they learn are to cope with are branches and looping. That gradually leads them to algorithms and by the time they are introduced to data-structures they are somewhat overwhelmed by it all.

That all too common path through the knowledge leaves most people hyper-focused on building up ever more code, more lists of instructions, even if that's only a secondary aspect of the software.

A code-specific perspective of programming also introduces a secondary problem. That is, variables are seen as independent things. One creates and manipulates a number of different variables to hold the data. They may be 'related', but they are still treated as independent entities.

In sharp contrast, the user's perspective of a program is completely different. What they see is the system building up ever more complicated structures of closely related data that they can navigate around. Variables aren't independent, rather they're just the lowest rung that holds a small part of the complex structure. The value comes from the structure, not the underlying pieces themselves. That is, a comment on a blog includes all of the text, the user name and link, the date and time when it was published and possibly some ranking. These are not independant things, they all exist together as one unit. But we also need the whole stream of comments, plus the original post to fully understand what was said. The value of a comment isn't a single block of text, it is the whole complex data-structure surrounding a particular comment on a blog entry somewhere. Individual data points are useless if they are not placed in their appropriate context.

There may be a huge amount of underlying code involved in getting a blog entry and its comments out onto a web page, but from the functionality standpoint it is far simpler to just think of it as a means to shift complex data-structures from somewhere in persistent storage to a visible presentation. And with that data-flow perspective, we can quickly get a solid understanding of any large complex system, particularly if in the end it only really displays a small number of different entities. The data may flow around in a variety of different ways, but most of that code is really just there to convert it into alternative structures as it winds its way through the system. The minimal inherent complexity of a system comes directly from the breadth of the complex data-structures that it supports.

Obviously, if you wanted to heavily optimize a big system, reducing the number of variables and intermediate representations computed along the main access pathways would result in significant bloat reduction. And not quite as obvious, it would also make the code way more readable, in that the programmers would not have to understand or memorize nearly as many different formats or manipulations while working on it. The most efficient code reduces the effort required to move data around, and also reuses any intermediate calculations as appropriate.

That in it's own way leads to a very clear definition of elegance. We will always have to live with the awkwardness of any of the language constructs used in the code, but looking past those, a really elegant program is one that is organized so that the absolute minimal effort is used to move the data from persistence to presentation, and back again. In that sense if you have a large number of complex data entities, they are slurped up from persistence to an internal 'model' that encapsulates their structure. From there they are transported as is, to their destination, where they are dressed up for display. If in between, there isn't any fiddling or weird logic, then the work has been minimized. If the naming convention is simple, readable and consistent, and any algorithmic enhancements (derived data from the raw stuff) are all collected together, then the code is masterfully done. Not only is it elegant, but in being so, it is also fast and very easy to enhance if you continue to follow its organization.

Viewing a large system as huge lists of instructions is complicated. Viewing it as various data-structures flowing between different end-points is not. In fact it leads to being able to encapsulate the different parts of the system away from each other very naturally. We can see, for instance, that reading and writing data to persistent storage stands on its own. Internally we know that we can't keep all of the data from the database in memory at once, so setting how much is necessary for efficiency is important. We can split off the actual presentation of any data from it's own inherent categorization. We can also clearly distinguish between raw data and the stuff that needs to be derived. This allows us to isolate larger primitives, lego blocks that can be reused throughout the code, simply because they are directly involved with manipulating larger and larger more complex structures. As we build up the structures, we pair them with suitable presentation mappings. And most of this can be driven dynamically.

In a very real sense, the architecture and the bottom up construction of the fundamental blocks comes directly from seeing the system as nothing more than a flow of data. That this also matches the way the users see and interact with the system is an added bonus.

Data really is the defining fundamental thing in building any system. The analysis and design start out with finding the data that is necessary to solve a specific problem. Persistence is the foundation for all of the work built on top, since it sets the structure of the data. The architecture is rooted on organizing all of the moving parts for the data-flow. The code, inspired both from the history of ADTs and Object Oriented, can be encapsulated along side its specific data, and for any of the stuff that does not fit elsewhere, like derived data, it can be collected together so that it is easily found and understood. Data is the anchor for all of the other parts of development.

Data drives computers, without it they are just general purpose tools with no specific purpose. In so many ways my Grandmother was correct, but what both she and I both failed to see all those decades ago was that as soon as we started filling up these machines with enough relevant data, they did suddenly become very, very useful. Code is nice, but it's not really what's important in software. It's just the means to access the data.  And for our young industry, it's definitely the case that we have only just begun to learn how to construct systems, once we've mastered complexity and can manipulate large amounts of data reliably, computers will transform our societies in all sorts of ways that are currently unimaginable.