Sunday, January 4, 2009

Digging into Data

It was a simple description that really effected me. The kind of thing that sinks deep into your core understanding, causing ripples as it descends.

I was reading "A Short History of Nearly Everything", written by Bill Bryson, an excellent, informative and entertaining book. He was talking about the number of cells in the human body, verses the number of bacterium and other independent single cell life-forms that travel with us. Shockingly, they out-number our own cells 10 to 1*, as they inhibit all corners of our bodies.

* "Every human body consists of about 10 quadrillion cells, but about 100 quadrillion bacterial cells" (p. 303)

If we mix that with the understanding that common house-hold dust is mostly made up of human skin tissue, and another fact from the book that our internal matter is possibly renewed every nine years*, then you get a fascinating perspective of your typical human being. We are, it seems nothing more than clouds of cells -- both dependent and independent -- leaving remnants of ourselves in our wake.

* "Indeed, it has been suggested that there isn't a single bit of any of us -- no so much as a stray molecule -- that was part of us nine years ago" (p. 373)

Wherever we go, we leave some of ourselves as we are gradually spreading our overall existence thin. In a real sense, even thought we see people as being these clearly defined creatures, a statistical view of just knowing where the bulk of them is at any particular moment is a more realistic model. You are, to some diminishing degree, wherever you've been. If you travel the world, bits of you remain there nearly forever.

We see ourselves as these discrete entities, completely contained, and certainly all encompassing, but our own perspective is skewed by our desires to make it easier to understand our own existence. The world around us is nothing more than a giant mess of infinitesimally small little bits interacting with each other in highly complex ways. To make this easier, we fit a perspective over all of this that collects the pieces together into a tiny number of smaller constrained 'things'. People, plants, animals, rocks, water are all viewed as static independent elements, even though we clearly understand that they are connected in some way.

That the underneath layer is made up of an nearly endless level of complexity we are trying to ignore, is significant in us coming to terms with the world around us. We go at the higher level as if it were the only one, surprised when some other interaction becomes significant, but the true sense of understanding comes from realizing that the way we clump things is entirely arbitrary.

Yes, history and science have driven our modern perspectives, another great point made by Bill Bryson's book, but the way we have labeled and categorized the world around us was not driven by some long-term rational plan. It was driven gradually by experimentation and luck, and that itself shows deep down in the fragments of our knowledge. Our foundation for clumping together the discrete things, is an irrational one at best. We can forgive that -- it is a by-product of the world we inhibit -- but we need to take it into account if we are trying to get a real grip on what we know.


DEFINITIONS

Before I wind my way too deep into this rather odd discussion, I should start with a few simple definitions that will work for the remainder of this post.

Information -- in at least the way I use the term -- is some amount of data relative to a particular subject. The term often denotes some underlying understanding, but generally I prefer to assign that to the higher-level term 'knowledge'. Information, doesn't have to be sensible or correct, it really is just a way of distinguishing the substance of some underlying differentiation. Data then, is just the simple points that make up information. In that way, the definition is harmonious with Claude Shannon's Information Theory.

Data is a senseless stream of individual datum, each one a specific fact usually about something. With structure, data becomes information, but information alone tells us nothing of the world around us. If you could imagine trying to read a novel full of nothing by dry, unencumbered facts about the world, without context, all of that information would not necessary help to build knowledge.

The data supports the information, but once we distill something usable from that it becomes knowledge that we can manipulate. Knowledge, in the end is what we are really after. Knowledge comes from our ability to make rational decisions based on information.

Together these three definitions allow us to say interesting sentences like: "we are collecting data, and from that information we are attempting to mine some knowledge." Data, information and knowledge define three increasing levels, each one providing significantly more value to us in our lives.

I know these are rather circular and pedantic definitions, but as I will be whipping around these three terms frequently in this post, I can't leave the subtleties of their definitions up to the reader or I'll get a lot of weird comments misinterpreting my statements.


OUR WORLD

Our world is made up of a mostly discrete view that we layer over the continuous world around us. Way underneath -- in an odd twist of fate -- we also think the world is discrete, but that depth is well out of the range of this discussion.

In fact, we needn't go too deep into our underlying nature, picking some completely arbitrary level of 'particles', be that classical atoms, electrons and photons, quarks, or any other low decomposition layer, it won't make a difference. For this discussion we don't need to be that precise.

Particles make up the world around us, organized into at least three physical dimensions and a more difficult temporal one. All of these dimensions are directly observable, so over time they've molded our perspective.

Not unsurprisingly, the physical world influences our language and the way we communicate with each other. Nouns, for instance, we use to relate to physical things that exist within the main three dimensions, and verbs, we use to describe time-based issues. The structure of our language itself, reflects our perspective.

The four dimensions of our existence actually limit our communication. We can talk about higher dimensions, five, six, etc. but only as abstract ideas, and only specifically through the language of mathematics. We have no common terms for fifth dimension objects for example.

The three physical dimensions are rather obvious, it's time that generally causes all of the trouble. We can see that fourth dimension in terms of the other three, but to make it work smoothly we need the rather odd perspective of thinking of our future selves as being different from us now. Two things separated by time then become the same as two things separated by distance. Time becomes a measure indistinguishable from distance. If you consider the initial description about our bodies recycling every nine years that's not as conceptually hard to imagine as we might suspect.

Beside the physicality of our world, the way we describe things is based on how, over the years we have come to break it down in various pieces. The knowledge that we've built up, has come from a massive number of discoveries, experiments, observations, theories and ideas all built up for thousands of years. So much of our understanding is based on events in the past, that we are constantly forgetting where and why it happened.

Radical new ideas gradually become common knowledge. They assimilate into the popular understanding, becoming obvious. Simple things like negative numbers for example, weren't always around and to someone seven hundred years old, they would seem like a highly weird concept.

Together our physical existence, and our history define the boundaries of our communication. And our communication defines the boundaries of our knowledge.


DECOMPOSITION

The world around us may be composed of a massive number of tiny particles, but to match our observations, and to make things easier we perceive all of these little bits as larger discrete entities.

So we can refer to them, we've named most of the things we've encountered. Mostly the nouns we use are solid physical things, or specific places, but gradually over time some abstract ones such as 'money' have creep into our perspective; they don't physically exist, but we pretend that they do.

Some things only have 'type' names, such as trees or bushes, while others hold specific 'instance' names as well, such as people or cities. Because so many of these instances share some multiple sub-similarities, we like to categorize everything into broader groups of related things.

For some things we group them by common patterns. All things made of wood, are wooden for example, while those of steel or gold are metal. For some things we break them down into arbitrary decompositions. People, animals and furniture have legs for example. Birds and planes have wings.

Both arbitrary patterns of particles and arbitrary decompositions drive our group associations.

The patterns and decompositions for all things, are themselves broken down into more patterns and decompositions. Sometimes nicely in an organized hierarchy, sometimes is messy overlapping ways that are not easily mapped against each other. In that way we create multiple intricate different overlapping ways to reduce and understand things, some that at times can be at complete odds with each other. Famously, the wave verses particle decompositions of light provide two contradicting different decompositions, both of which appear to be correct.

So we live in a world were all things are made up of an infinitesimally small number of little bits that we have arbitrary clumped into many overlapping subsets and matched back to other patterns and other subsets, in order to form descriptive language to communicate these pieces and how they interact.

As things happen in time to our various nouns, we use a huge number of verbs to indicate their behavior. We have a wide arrangement of choices on how to describe what is happening. Many verbs themselves contain elements of 3 dimension structure, but are still tied to a temporal perspective. Running for example, in reference to humans means specific movements of arms and legs, while in reference to a car means specific reference to parts of the engine movement.

In those cases where we do not have a very specific type or instance name, we can always construct a unique (enough) reference out of a combination of specific sub-pieces. We can put together enough pieces to be able to communicate something specific, such as a caution about the man in the red shirt standing by the corner-store. Each piece is ambiguous, but he combination is precise enough to communicate something.

From our various representations of physical things and events to each other, we come to build up a significant amount of knowledge that allows us to manipulate the world around us to a higher degree then all of the other species on our planet. More plainly: we can build stuff that works, because we have a reasonable understanding of why it should.


AN ANCHOR

What's interesting about all of this, is that within the last century we've figured out how to build machines that can help us manipulate our information itself, hopefully leading us to a place of higher knowledge. At some point, which we've likely cross long ago, the information we've collected about our world has exceed each individuals ability to turn that into useful knowledge. No one single human can know it all, it is beyond our capabilities.

Of course, in the past, knowledge was as easily dropped as it was passed on to the next set of minds, so progress was slow and painful. Books made a huge difference in preserving it, computers offer an even bigger improvement. They offer us the power to deal with massive information, but first we have to come to terms with what we know, before we can utilize them properly. The machines exist, and there is lots of theory about them, but we fail to grasp how to map our real existence into the computer accurately.

We go at data as if it is entirely discrete, loosely-linked, independent points that we can mix and match with ease. We map our desired discrete world around us, onto a simplified discrete one in the machine, but then we are surprised when that mapping fails.

Instead of the oversimplified static relationships that we are currently using, we have to come back to viewing data for what it really is underneath.

At some low-level, the world is made up of a nearly infinite number of particles that can interact with each other in a nearly infinite number of ways. In fact the permutations are so vast and large and complex that nothing short of an actual universe can simulate them to any truly accurate degree. That is, the only way to get a perfect model of the universe, is for that model to be an exact copy. No lesser amount of information will even be correct enough, eventually drifting away from reality at some predictable pace.

But those inherent inaccuracies need not be significant at our scale in the universe. We don't need to take quantum effects into account when simulating the physics of a billiard (pool) table for instance.

What is important isn't necessarily the precision of the data, but the fact that there can be a relationship between any two particles that may be of interest. Now, mostly there is not, as most of the particles are separated from each other by other particles or vast distances, yet over time there is always some increasing probability that there could be some relationship to bind the two together.

It would be impractical to talk about storing all of the particles on a computer for everything that we want to track in the world around us. That's far too much data, and as the relationships are mostly proximity based, it is sparse information at best. We needn't waste that much resources, at least not if it isn't providing some tangible amount of knowledge.

What we do want to do is to track large subsets of these particles, and to them we want to assign some symbolic tokens to represent them in the computer. I, for instance might exist within an accounting system as a 'user', or a 'vendor' or some other role significant the underlying data store. My essence is in the real world, the computer just holds some key facts about me. The more information it collects, the more I can utilize that information to make some positive effect on my life in the real world.

More importantly, for all of the things we collect, we want to be able to easily tie them back to any other arbitrary subset or decomposition, in the same way we do in the real world.

And we won't know any of these relationships in advance. If we knew the complete importance of any piece of information we were collecting, then we probably would not need to collect it. The usefulness of the data is derived from the fact that it is not just some static fact stuck in a pre-defined schema that never changes. That only works as a model for the most limited of 'accounting' systems.


THROUGH THE LOOKING GLASS

All of this is a rather messy yet still somewhat conventional way of looking at the world and data. Still, it explains why software development has so many current problems.

We know that at the lowest level there is always some chance between any two particles that there might be a relationship. Moving this upwards, there is some minute, yet greater chance that there is some track-able relationship between the symbolic representations of any two things in the computer. For any given data in the system, it may someday be necessary to tie it to any other data in the system.

In easier language, there is an nearly infinite possible set of relationships for each and every piece of data in the system. Limiting that to a fixed subset will fail.

When we do something like choose a relational schema for a database we are shedding the possibility of all relationships, only to concentrate on a small subset. Unless one is particularly objective, knowledgeable and precognitive, many of the required relationships between the different entities will not make it into the schema. That would require too much up-front knowledge about the elasticity of the data set, information that is only gradually available over time.

In truth, no matter how diligent you are, a fixed set of static relationships will never be enough relationships to model the world around us. Particularly as the system grows, we need to decompose the data into more detail, but we also need to handle more meta-relationships as well. The system opens up a new understanding, which itself changes the way we view the data. It's a classic chicken and egg problem.

Most database technology and practice on the other hand, is founded on the idea that the world is static. That once it is set up, it is unlikely or not particularly difficult to change it. But in practice it is extremely hard to fix schema problems, making the upgrades risky, complex and hugely resource intensive.

This is also compounded by the fact that the 'upgrades' themselves from one view of the data to the next are always an afterthought for most software development teams. It's funny how the hardest problem in software -- going to the next version of the system -- is always the last one in consideration, isn't it?

The success of our computer systems is clearly bounded by the amount and quality of the data we collect. Failure, is an inability of any kind to be able to utilize that effort for something in the real world. If we can't turn the data into information, and turn that information into knowledge, then we have wasted the resources that went into the software development and operations.

The information in a computer system is composed of data bound by various changing relationships, some strong and explicit, while others are weak and implicit. The explicit ones are actually linkages in some point or another, whether pointers, data structures or other technical contrivances. Implicit relationships are ones that can be made deterministically by some limit amount of code. There are also heuristically obtainable implicit relationships, and those that are pure guesses, but it's enough to know that the link between the data is not static.

The number and types of these internal relationships are essentially infinite, or at least as close to that as necessary. There is no real world limit on how particles interact, so there should be no symbolic ones either. Preserving this inside of the computer is key to matching that data back to the outside world.


COMPLEXITY AND ITS COUSINS

If you see data as symbolic tokens, representing points in the real world, with a limitless number of relationships to the other tokens in the system, then it isn't hard to see why modern software systems are failing so often.

The designers narrowly focus on some small subset of data, and some small subset of relationships. They take both of these, and build around them a lot of complex, yet fragile strings of instructions; functionality. At some expected point, someone, somewhere notices that at least one point of data, or a relationship isn't correct, or deep enough, but in a misguided effort to avoid losing information, the programmers attempt the bolt the new structured data onto the side, so as not to interfere with the existing system.

Of course, the side mounting technique simplifies the updates, but places some artificial complexity into the system in terms of relating the new data back to the old data. Complexity, that is solely created explicitly and only for the purpose of binding together the two new sources of data. Complexity that is 100% arbitrary, and has no connection back to the real world.

As such, each and every time the developers seek to extend their system, by preserving some amount of nearly correct information, they slide a little farther down the slope of uncontrolled complexity. If, as is sometimes the case, the original design is sloppy and imprecise, a reasonably small initial system can find its way into an enormous ball of unconstrained complexity in a very short period of time. Unseen and uncontrolled, these types of projects fail, or just become tar pits until they are rebuilt.


CONTROLLING CHAOS

Software projects are saved on understanding of the right answer verses a workable answer. In the short term, until we have technology that really helps us with data, developers need to realize that their own limited understanding of the relationships between the data they are trying to capture is in fact, a limited understanding of the relationships between the data. Hubris, as an initial problem, weights down projects to dangerous levels.

Even with the most obvious data, there can be things that come along and cause trouble. Accepting this is critical. An architecture based around any static relationships between the major data entities in the system is an a fragile architecture. You always have to believe that there could exist some irrational, but necessary relationship in the future that needs to be captured as part of the system. Crazy yes, but most veteran programmers can tell lots of stores about ugly irrational data.

The most important point for any piece of software development is to be able to fix the real problems before they blow up in complexity. If the underlying data model is wrong, then any type of band aid fix will jeopardize the software. This is always true, and it will always be true.

In some circumstances, a quick hack may make the most appropriate business sense, but only if it is also paired with the work to properly fix the system. If you pull down the pillars in a building that are holding the weight, it will collapse. If you badly patch the data in a system, at some point it too will lead to something bad. Always, guaranteed. It's only a matter of when.

Looking forward to the harder support and update issues is a key part of handling the design of a large and complex system. At each point the software developers need to understand the "right" answer, and make decisions about why and how they can apply that to the system. Some systems start a huge distance away from their final invocations, so it isn't always an easy choice. Little-by-little, piece-by-piece the problems are corrected and the data model is made better and more reliable.

Oddly, younger less experience programmers focus heavily on the algorithms or coding optimizations, but both of those are relative easy problems to fix in practice. You can swap out a weird behavior easily, but getting ride of any persistent data associated with it is a huge problem.


FINAL NOTES

The industry perspective is on creating code. Our technologies have been built around simple, stupid static data that is easily manipulated. A computer however, is really just a device to help people pile up large amounts of data for analysis. Code helps with this task, but the importance of a computer comes from the data itself.

Data is the heart and soul of the machines, yet it is a secondary thought for most programmers. It's structure is far more complex than most people realize, and it is very easy to underestimate its complexity. Programmers are great at assuming there are a limited number of fixed relationships of interest, but they are frequently incorrect and not willing to admit it. Expecting an infinite number of irrational relationships leaves the data unconstrained and matches the world around us.

Of course, with so many of our technologies based around the assumption of static data, it's not hard to see why there are so many problems with our implementations. Building flexible generalized handling on an inflexible static technology is complex, but it is doable.

There are more things I can say about data, particularly if we want to analyze its underlying structure as it becomes information, but that's best dealt with in a future post.

If you understand the data, the code is trivial. If you get the data right, the system is fairly stable. You can build complex things and keep them running for long periods of time. There are still lots of problems, and the limits of the technologies can be frustrating, but at least the complexity can be managed if you can accept where it is coming from and why it is created there.