Monday, October 22, 2007

First Principles and Beyond

My brain uses an efficient algorithm for garbage collection. Sometimes, it can be a little too efficient. I'll be in the middle of using some fact and then "poof" it is gone.

"Gee, thanx brain! Now I'll have to ask for that phone number again, ... for the third time". Wetware can be such a pain.

Besides the obvious disadvantages of having such quick fact cleanup there are in fact a few bright sides to limited short-term retention. One important side effect is that I am constantly forced to go back to first principles, mostly because I've forgotten everything else. Re-examining the same ground over and over breeds familiarity. You not only come to understand it, but you also come to simplify that understanding; one of the harder things to do well. Because I can remember so little, everything must be packed as tightly as possible or else it is forgotten.

That perspective is interesting if I used it to examine something such as software development. We always build up our knowledge overtime, sometimes to the degree where we make things more complex than necessary. Sorting back through our understanding is an important exercise.

Building software can be a very complicated pursuit, but underneath creating software is actually fairly simple. We are doing nothing more than just building tools to manipulate data. In spit of all of the fancy features and appearances, when you get right down to it, the only things a computer can actually do are view, add, delete and edit data.

Playing music on a portable mp3 player is a good example. The computer copies the data from its internal storage over to some circuitry that generates the currents for the sounds. We hear the currents vibrating through the speakers, but the computer's actual responsibility stopped just after it copied over the information. When it is no longer just a collection of binary bits, it is no longer a computer that is doing the work.

We can go deeper. When you get right down to it, all software can be seen as nothing more than just algorithms and data working together within some framework. At the higher level we perceive this as software applications with various functionality, but that is an illusion. Software is deceptively simple.

We can take advantage of this simplicity when we build systems. Orienting our understanding of what we build based on the data it manipulates simplifies the overall design for a system. The data is the key. There can be a great deal of functionality, but it is always restricted by the underlying range of available data.You have to get the data into the system first, before you can start manipulating it.

One key problem with badly written software code is the amount of repeated data in the system. Big balls of mud and poor architectures are identified by the redundant copying and translating of the same data around the system in a huge varieties of ways. This not only wastes CPU and memory, but it also makes understanding and enhancing the system difficult. A sense of complexity can be measured from the volume of redundant code within a system. If, with enough time you could delete half the code, yet keep the same functionality you have a system in need of some serious help.

Understanding the importance of data not only effects the code, but it translates all of the way up to the architecture itself. The blueprints that define any system need only consist of outlining the data, algorithms and interfaces within the system. Nothing more.

In most cases, the majority of the algorithms in system are well defined and come from text books. Probably less than 20% require anything other than just a brief reference, such as "that list is 'sorted'".

Interfaces are frequently the worst handled part of modern software development, generally because only the key ones are documented. The rest are just slapped together.

Anywhere that people interact with the computer should be considered an interface, which not only includes GUIs and command-lines, but also configuration files, libraries and your database schema. Rampant inconsistency is a huge problem these days. Consistency at all levels for an interface reduces its size by at least an order of magnitude. Most interfaces would be just as expressive, at less than a quarter of their size.

For large projects the most natural lines for subdividing the work are the natural boundaries of the data. Those boundary lines within the system -- either horizontal (components) or vertical (layers) -- themselves should be considered interfaces. Specifying them as part of the architecture ensures that the overall coherency in the system remains intact.

Within parts of the system there may also be boundary lines to delineate different functionally independent components. This helps for iterative development as well as supporting team efforts. If the lines are clear, it is easy to pick a specific set of functionality, such as a resource manager, and replace it with another more advanced one. Growing the system over a series of iterations reduces a large amount of development risk at the cost of a relatively small amount of extra work. When compared to going completely off course and having to start over, iterative development is extremely cost effective.

Orientating the development around the data not only makes the code cleaner and more advanced; it also provides the lines to break down and encapsulate the components. But, even more critically, it makes analyzing the underlying problem domain for the tool much simpler as well.

When you understand that the users need a tool to manipulate only specific subsets of data in very specific ways, that drives the way you analyze the problem. You don't get caught worrying about issues that are outside the boundary of the available data. They are outside of the software tool's scope.

In looking at the users, you see the limited amount of data they have available. They want to do things with their data, but only a limited number of those things makes sense and are practical. Technology problems become so much simpler when seen this way. If you can't acquire the data to support the functionality, you could never implement that code. You don't need to get lost in a sea of things that are impossible.

Getting back to data. The great fear with programmers specifying their systems before they write them is that they feel this will make their jobs trivial. Some think that they can stumble around in the dark, and eventually get to the right system. Other just want to build fast and light in the hopes that over time they find the right tools.

Truthfully, you cannot solve a complex problem without first thinking deeply about it. The fruits of your thinking are the design of the system. If you don't want to waste your time, you need to have worked through all of the critical issues first. If you don't want to waste your thinking, you need to get that information down into a design.

An endlessly large list of requirements and functionality isn't a coherent design.

A decent architecture of data, algorithms and interfaces shouldn't be a massive document. In fact, for simple systems, it might even be just a few pages. You can refer to standards and conventions to save time in re-specifying facts which are well known. Nobody reads the obvious, so why waste time in writing it.

Such a design, if it breaks the system down into clear parts with well-known lines, doesn't remove the joy of programming. Instead it instills confidence in the initial labors, and confidence that the project will be able to succeed with few surprises.

Instead of making programming a drudgery, it actually makes it more fun and less painful. I've learned this the hard way many a time, but thanks to my efficient garbage collection, I've also forgotten it a few times already.

The simplicity underlying software is the key to being able to build increasingly complex systems. Too often we get lost in the little details without being able to see the overall structure. Most developers still oriented their understanding of software around its implemented functionality, but this just illuminates the inner workings in its most complex fashion. The data for most systems is simple and it absolutely bounds the code, so orienting your perspective on it simplifies your issues and broadens your overall understanding. We really don't want to make our lives more complicated than they need to be; that threatens our ability to be successful. I've learned that with my limited memory the only real problem I can afford to face in a huge system is the complexity of the system itself. To keep it manageable I need to continually simplify it. To keep it simple I often need to go back to first principles and find the right perspective. It is amazing how much of the complexity can just fall away with the right vantage point.