Saturday, May 31, 2008

The Essence of Software

A few posts ago, I did some wishful thinking about the possible value of having blueprints for software projects:

One of my readers John Siegrist, became enchanted by the idea and suggested that we start a project to look further into it. To this end, we've put together a wiki at Wetpaint for anyone who is interested to pop by and give us a hand:

One NOTE of CAUTION: many discussions in software end up in polarized arguments based around intractable positions defending one form of technology over another:

I'd like to avoid that for this new site. If you come to visit or to contribute to this project, please come with an "open mind".

An exec once told me: "the dirty little secret of the software industry is that NONE of this stuff works!" That's closer to the truth than I think most people realize. In the end, it all has its share of problems, and we aren't even a fraction of the way there yet, so why dig in an call it finished? If it is to be a real quest for answers, we first have to let go of our prejudices, only then can we objectively see the world.


To get our work going, I figured I would fall way back and try decomposing the essence of software.

We all know, or think we know what it is, but one of the great aspects of our world is how it allows us to look again and again at the same thing, but from a slightly different perspective. Each time, from each view we learn a little more about what we see. The details are infinite and often times even the smallest nuances are the cause of major mis-understandings.

My first observation is that the only thing that "software" does is to help us build up "piles" of data.

Hardware may interact with the real world, but software only exists in its own virtual domain. The only thing in that domain that is even remotely tangible is data. Oddly, people like to think that software doesn't physically exist, but in fact it does have a real physical "presence". At minimum it's on a hard disk, if not stored activity in the RAM of at least one computer. It is physical -- magnetically polarized bits of metal and currents flowing through silicon -- it is just very hard to observe.

The "pile" perspective is important because it gets back to what I think is at the heart of all software: the data.

Most programmers tend to think of software as a set of instructions to carry out some work, but that subtle misleading perspective lures one away from seeing how much more important the data is than the instructions that operate on it.

That is a key point which always helps to simplify one's understanding of how to build complex systems. As masses of "instructions", these systems are extremely complicated, but as "transitions" on various structures of data, they are in fact quite simple. You can convolute the instruction set all you want, but you cannot do that with the data structure snapshots, they are what they are.

That leads nicely into another major observation. All software packages are nothing more than just a collection of functions that help manage a pile of data.

It doesn't matter if the software consists only of command line utilities, a simple command shell, or a fancy GUI. All of these are just ways of tying some "functions", their "arguments" and a "context" together to display or manipulate some data. It doesn't matter if an actual "user" triggers the function or just another piece of code, it is all the same underneath.

The functionality "gets" some clump of data, applies some type of "manipulation" to it, and then "saves" it.

If it's printing something to the screen, the 'get' probably comes from some data source, the 'manipulation' is to make it pretty, and the 'save' is null. If it is adding new data, the 'get' is null, the 'manipulation' is to fill it out, and the 'save' writes it back to a data source. Updates and deletes are similar.

There is nothing more to a computer program; the rest of it is how this is repeated over and over again within the code.

The source of the data can change. It can be an internal model, a library of some type, a file, or even a direct connection to an external data source like an RDBMS. All of these things are just different representations of the data as they are stored in different mediums. Although the data may be similar, the mediums often impose their own bias on how the data is expressed.


If you freeze time, you'll find that at any given moment all of the data in a computer system is just a set of static data structures.

This includes of course the classic structures like linked lists, hash tables, trees, etc. But it also includes simple primitive types, any explicit language structures, and any paradigm-based structures such as objects.

Pretty much any variable in the system is the start of, or contained in some internal data structure. Some of these structures can be quite complex, as localized references touch into larger global structures, but at our fixed point in time all of the variables in the system belong to at least one (possible trivial) data structure.

That is important because we want to decompose all of the "code" in the system as simply being a mapping from one data structure to another. For any two static data structures A and B, a function F in the system simply maps A -> B. In its simplest form, a computer is just a massive swirl of different constantly morphing structures that are shifting in and out of various formats over time. And hey, you thought you were shooting aliens, didn't you?

Because data is the key, we need to look deeper into it.

Data is always "about" something. It is an instance of something, be that a noun or a verb. If it is a noun, than more than likely the data is about a three dimensional object in our world. A placeholder for something we know about.

If it is a verb, then it is more than likely related to "time" in some way. The nouns are the objects in our world, while the verbs are the actions.

That makes the verbs interesting because their definition allows within it, for data to exist about some functional transition within the software. I.e. we can store the list of last run commands from an interactive shell; these become verbs which relate back to the most recent actions of the shell.

That "relationship" is also interesting, but it should not be confused with running code; frozen in time, the data stored -- verb or noun -- is still fixed static data that is set in a structure. The "meaning" of the data may be relative (and very close) to the functioning of the code, but it does not alter what it is underneath, which is static data in a structure. You keep a description of the commands, not the actual commands.

Going backwards and selecting an old command to run is simply a way to fill in the "context" for the run functionality to something that occurred earlier, the computer is not in any way shape or form, reconnecting with its earlier results.


There is, it can seem, almost like a magical tie between data at times. However, computers are simple deterministic machines, so there is absolutely nothing special about the way some data is linked to other data.

The only thing unexpected is how, under the covers, the computer may be holding a lot more "contextual" information than most people realize.

As a massive sea of data structures, to make it more functional, users often need to bind one set of data to another one. There are pieces of data, that when understood "link" one piece of data to another one. These links come in two basic flavors: explicit, and implicit.

The easiest one to understand is a an explicit link. It a a connection between two data structures. It can be direct, such as a "reference" or a "pointer", or it can be indirect bouncing a number of times between intermediate values that bind the two structures together in a given direction. As there is no limit to the number of "hops" in an indirect link, in many systems these can be very complex relationships.

Explicit links are always one way, so two are needed if you intend on traveling from either structure to the other one. For example, "doubly" linked lists allow you to equally traverse the list from either direction, singly linked lists can only be traversed in one direction.

Explicit links form the bread and butter of classical data structures. All of the difference instances, such as lists or trees are just different ways of arranging different links to allow for various behaviors during a traversal.


A well-maintained set of explicit links between a series of data structures, forms itself into one great big giant structure. All the transformations need to do is traverse the structure based on rules and then update part it or a new structure. It is a very controlled deterministic behavior. If that were all there was to software, it would be easy, but unfortunately it is not so simple.

An "implicit" link is one that "can be made" between two data structures, allowing some piece of code to traverse from one to the other. The "can be made" part of this sentence hides the real complexity, but first I'll have to detour a little bit.

An algorithm is a deterministic (possibly finite) sequence of steps that accomplishes some goal. Well, better than that: it works 100% of the time. A heuristic, on the other hand, in Computer Science is a set of sequences that "mostly" works. I.e it works "less than 100%" of the time. It is hugely critical to see the difference between these two, algorithms ALWAYS works, heuristics MOSTLY work.

It is a common misconception to think a particular heuristic is in fact an algorithm, IT IS NOT, and they always need to be treated differently.

If you stick an algorithm in your system, and you debug it, you know that it will always do the right thing. With a heuristic, even after some massive amount of testing, production, etc. There is a chance, even if it is minute, that it could still fail. A chance that you must always be aware of, and account for in your development. There is a huge difference between knowing it is going to work, and thinking that it is working. Even if its a 0.00000001% chance of failure it opens the door as a possibility. Failure doesn't automatically mean that there is a bug.

For an implicit link, there is some piece of code, somewhere that can bind together the two ends of the link. Usually in both directions, but not always. Where an explicit link traversal is an always an algorithm, and implicit link one may not be. It depends on the code, and it depends on the data. Thus it gets very complicated every quickly, especially if your not careful.

All implicit links are code-based. Common ones include full-text searching, grep and even some SQL queries. Anywhere, where the 'connection' is dependent on some piece of running code.

So for example, if you have a user Id in your system, going to the database to get the 'row' for that Id is explicit. Going to the database to to match a user-name like '%mer%' is implicit. The Id in the system is data that is bound to that specific row, and if it exists, it will absolutely be returned. In the implicit case the wild-cards may or may not match the expected 'set' of entries. You may be expecting only one, and multiple ones could be found. There are lots of different possibilities and the results are not always consistent.

There is some variability with implicit links based on text and wild-card searches in a relational database. There is a huge amount of it when using a full-text Internet search tool like Google. In the latter case, the results can change on a frequent basis, so that even two concurrent searches can return different results.


Explicit links help maintain the deterministic quality that is inherent in a computer. The problem, is that they are fragile, and if the data is linked to a huge amount of related data, they can be horrible to keep updated.

Not only that, but they require fore-though on the part of the computer programmers to put them in place when the data is collected. Often, because they are just guessing, the software developers rarely know how the data will get used years later, and it is generally a surprise. Forgetting to collect data relationships is a common problem.

On the other hand, implicit data is imprecise, messy and can also require mass amounts of CPU. Not only that, but connecting any two non-obvious pieces of data heuristically is an "intelligence" problem; requiring human intervention.

In most modern systems the quality of the data ranges from poor to extremely crappy, which becomes a significant impediment toward implicit links. At least explicitly, the quality issues are dealt with at the time of constructing the initial data. Low-quality data can provide an astoundingly strong defence against making implicit links, where under some threshold, the linking might as well be random.


So why is this important in terms of software blueprints? One of the key points I think is important for a blueprint is to be able to specify the behavior of the system in a way that isn't technology dependent. Even more important, it cannot 'ape' the underlying programming paradigm.

Going way back, we were taught to write out pseudo-code before we settled into writing actual code. It was an early form of trying to find a more relaxed way of describing the behavior without having to go into the rigorous details.

The problem with that, is that pseudo code is only a mild generalization of the common procedural programming languages of the day. I.e. it was just sloppy C code, and if you were developing in C it was only a half-step away from just writing it.

That means that the effort to pseudo code an algorithm wasn't actually all that far away from the effort of just coding it, so why not just code it?

In some circumstances, alternative perspectives, such as ER diagrams and SQL schemas are useful enough that you might choose to switch back and forth between them. The textual view vs. the graphical view is useful in distinguishing problems, particularly consistency ones.

With C and pseudo code, it is a one way mapping: C -> pseudo code, and the effort is close enough, and the perspective is close enough. There is little value in that.


Construction blueprints are simple 2D representations of 3D objects that contain just enough information that the building can be reliable, but not enough that they require the same time to generate. They are a tiny (in comparison) simplified representation of the core details, enough to allow someone to visualize the final results, so that it can be approved, and it isn't a shock when completed. One person can lay out the essence of the work later done by hundreds.

If a computer system is just a collection of functions that operate on a specific pile of data, then we are looking for a similar higher-level simplified view that can provide just enough information to know the design is good, without requiring enough effort to take years to complete.

I see this, abstractly as being able to get a hold of the main 'chunks' of data in the pile, and listing out only some of the major functionality about them. In a sense, I don't care if the user's name is stored as full-name, first-name/last-name or first-name/middle-name/last-name, these are just 'stylistic' conventions in the schema if they don't have a significant impact on the overall functionality of the system.

If the system holds address information, the specifics of the decomposition are equally unimportant for the blueprints. One big field or a variety of small ones, often the different is trivial or nearly trivial.

At times, the decomposition in one system may help or impede implicit links with other systems. In these cases, then for that interoperability the specific 'attribute' may be important, but only if one thinks about it ahead of time, or it is some corporate or common standard that needs to be followed.

Not only are the smaller sub-attributes of the specific data not important for the blueprints, but also a significant amount of the overall functionality is unimportant as well.

Who's not seen a specification for a system that goes into over-the-top extreme detail about every little administration function, trivial or not. Most often it was a waste of someone's time. Frequently it doesn't even get implemented in the same way it was specified.

In most cases, if you have the system handle a specific type of data, such as users, for instance, there are 'obvious' functions that need to be included. If you store user information, you need to be able to add new users, delete them, and change their information. If you miss any of this functionality initially, you will be forced at some time to add it back in. The only 'real' issue is "who" has access to these functions.

In fact, it is pretty safe to say that for each and every type of data that you allow to be manipulated, in one form or another, you'll need to put in add, delete and modify capabilities. To not do so, means the system will not be 'well-rounded'; it will be only a partial implementation.

While that's a trivial case, the underlying general case is that if you walk a specific 'path' with some set of functionality, you will have to walk the entire path at some point. I.e. if you quickly toss in something to add some new type of data, later the delete and mods will become highly-critical changes that need to be implemented.

If you provide some type of linkage to another system, then a whole host of data synchronization, administration and navigation issues will all show up really fast. More importantly, once you take even one baby step on that path, everyone will be screaming about how you have to take more, and more. User's never let you get away with just a partial implementation, no matter why clever you are in thinking that you can simplify it.

If you open the door and take a couple of steps, you're committed to have to round out the implementation.

From a higher viewpoint, that means that data comes in 'lumps' and functionality comes in 'collections'. So much so, that we could go a long way in a software program with just a simple statement:

"The system will require user accounts and allow them to be managed."

That specifies a lump of data, and a collection of functionality to manage that data.

If the system is a Web 2.0 social site, the "unique id" for the users may be an email account, if it is more of an older 'main-frame' style, it is probably a centrally administered mangling of the user's name.

The specification gives a general idea for the software, "convention" sets the specific data and functions. The blueprints are some way of having all relevant parties understand that the software requires the users to log in with some type of account information. What that really means is up to the the underlying technology and the driving conventions. The blueprints are higher than both.


There is a significant benefit in being able to decompose software into its essence. If we look beyond technologies, techniques and paradigms into the heart of what we are doing, we can find the types of common patterns that allow us to frame our work in a more meaningful way. And, as in the case of our search for blueprints, we can use that higher-level perspective to help us decide what insignificant 'details' we can drop from the blueprints that won't "radically" change the implementation.

Blueprints help going both forward and back.

In front they help us leverage an experienced architect's ability to design and implement large complex systems, and behind they allow architects traveling in the same footprints to take something they know that works and extend it. We can, and should learn from our past, and apply that to our future. A reasonable thing that most other disciplines already do as part of their normal process.

Speculating, I also tend to think that if we find someway of creating good software blueprints, that a side effect of the format will be that the perspective will become data-centric, not function-centric.

While data is not the intrinsic way to view systems, it is far far easier and has been at the heart of movements like object-oriented. It is just that it never seems to really catch on, or it becomes perverted back to being functionally-oriented.

If blueprints help to teach programmers to "see" their systems by the data first, then they will inevitably help to simplify them. In an industry where companies routinely build tens of millions of lines of code to solve problems that only require a fraction of that size, we need significant help in learning how to create elegant solutions, not brute force ones.

The explicit complexity of a system should not significantly exceed the ability of a programmer's lifetime to fully understand it. That's clearly past some reasonable threshold. Ideological tools like abstraction should be applied to compact our knowledge into some form that is more manageable. Especially if our lives and finances depend on it.

We are at that fork in the road where we can decide (again) whether we want to continue down the brute force path, or whether we would like to shift to a higher less painful route. Personally, I'm hoping to avoid more pain.