Sunday, March 30, 2008

Controlling Development Chaos

A lot of software developers really hate any comparison or analogy that involves building construction. I think the primary reason behind this dislike is the sense that construction workers are viewed as passive players in the whole development project, in the same way that factory workers are seen as nearly-mindless cogs in a production line.

There is some reason to fear this type of comparison, intellectual pursuits are clearly not the same as physical. The rules are different. For whatever reason, we have a great deal of knowledge on how to refine and maximize the efforts of factory workers, but virtually none on how to do the same for white-collar jobs. Trying to take what we've learned from the physical and apply it to intellectual hasn't been particularly effective.

Still, in this post, I start with looking back at a construction analogy and then moving forward to a more sophisticated, and probably more popular analogy. Even if construction is not a perfect fit, building things is always a lot like building things, and we can always learn a lot from the way other people build things.


THE UNLOVED CONSTRUCTION ANALOGY

While I agree that we shouldn't draw two many parallels between software and construction, there are still things about a construction analogy that are 'truish' enough to be useful.

For instance, a large sky-scraper project contains a huge number of specialists, including carpenters, plumbers, electricians and of course general-labourers. At the same time as well are engineers. managers, inspectors and a whole host of other people with interesting roles in keeping the process, and the building from collapsing. Software projects are similar, in that they often need graphic designers, editors, GUI expects, database programmers, system administrators, managers, tool-smiths and all manner of specialists for handling any sophisticated technologies or domain issues. They are multi-disciplinary projects.

Both software development and construction involve building things and both, to some degree or another, change radically depending on the size of the thing you are building. A shed out back is not the same as a house, or an apartment building. A sky-scraper is a completely different deal. Scale has a huge influence on the size and techniques required to bring it all together. One does not easily jump from building sheds in a backyard to building apartment buildings, it is a different set of skills.

The construction of a modern sky-scraper is an amazing project that is pulled together with a degree of smoothness that software developers can only drool at. If we could build as large and sophisticated things, with the same degree of precision and timing, then we would have been able to get beyond our current hit-or-miss random guessing style of development process.

It is worth mentioning too, that while design and code are fixed to some degree, the job of an electrician on a huge site still requires a significant amount of thinking and problem solving. It is not a clear-cut mindless day at the office. While they may be working with their hands, electricians still need to think about what they are doing, the various plans and codes, and how they are going to make it work in the current context. The same is true for all of the other specialists. It takes more mental effort than just being a body, that after all is the specific job assigned to general labourers.


WHY IT FAILS

But still, even if you give all of that its due, construction just doesn't map entirely onto software development. We know this, it is a frequent discussion. Sometimes, I wonder if it really is a case of us just not wanting to accept it; we have a great deal of freedom, and it becomes hard to surrender that, even if it is for the good of the project. Still, there are strong differences.

The biggest single problem with construction is that it is a one-phase deal. You build the building, and then move in and repair it as you go. When we tried this with our waterfall ideas, we found it doesn't match reality. Code is seems, never gets done and is always rusting. Projects just don't end, and when they do, the code goes out of circulation. For software, there are a nearly infinite number of phases, ending only with the lifetime of the code.

The state of the art of building design has progressed over centuries, and mostly it is only slightly tweaked between buildings. In a sense, construction companies redo the same projects over and over again, they have no choice. This repetitiveness is the root of mastering the process. If you do it enough, eventually you'll get good at it. Contrast that with software, where each design for each system is unique and rarely learns from its predecessors. Because we can clone code, we don't want to keep reinventing it (even where in some cases, that might actually produce significantly cheaper or better code). Code builds up, but our experiences do not.

Buildings stay around for a long time. That aspect, as well as safety considerations bends the choices made towards the less risky long-term options. Saving time or money by not following the process, the code or using inferior parts may help in the short run, but because the life span is so long it opens up a lot of risk of getting caught in the future. Most buildings are built correctly. Software, on the other hand, being mostly invisible and uncontrolled doesn't provide much incentive for programmers to make long-term choices, despite the fact that software projects are always long-term projects.

Another significant different is that workers on a construction site have less degrees of freedom then computer programmers. Their roles and jobs are far more rigid. Right or wrong I don't know, but this is a key issue that I want to get back to later. We've always had our freedoms, and they have always been a problem.

For all of the differences, there is also a lot that matches, but not enough that we should try to emulate construction companies, although we should admire their skill and organization. But it is exactly that line of thinking that sent me in search of some other type of better suited analogy.


GOING BEYOND CONSTRUCTION

As often happens with me, I get a little down the road of a thought, and then it gets left behind for a while. In particular, I dropped my pondering of construction analogies so that I could do my usual Friday night routine, which was watching a film with friends. The film -- which I can't remember -- had finished, and we had moved on to the DVD special features.

As the director and actors mutually patted each other on the backs and proclaimed their love of working together, a little thought was brewing in my head. A film, you see is a large complex project, that also brings a large number of multi-disciplinary professions together in order to create something. Most people don't quite see it as the same thing, but if you look at the effort and money that was poured in a mega production like Lord of the Rings, you start to see some similarity to these huge buildings we keep throwing up in our cities. But, it is hugely different.

We deem a film as art, and the director, as an artist. We see most people involved with film as artisans. Even thought, in the end, there are also carpenters, electricians, plumbers, casting agents, cooks, boom operators, grips, computer programmers, and a huge huge host of other professionals involved.

What's interesting with a film, is that even thought the script is often written by one or more writers, and a huge number of producers and managers are involved, the films themselves have always held the particular 'stamp' of the director. In a very real sense, even with all of the 'creatives' vying their wares, the director of a modern major feature film, gets and sets the final 'vision' under which the film will be created. You can tell a lot more about a film by its director, then you can by its writers or actors. In many ways, the director of a film, is far more significant than the architect of a building. An architect shapes the design, but the engineers ensure it is built correctly. A film director often has no such constraints, at least not ones that are 'that' objective (budget excluded). They control the vision.


ANOTHER ANALOGY

But if we go back to the special features, even with all of the influence, the staring actors still often talk about how much artistic freedom they were allowed in playing their parts and contributing to the film. In that way, a good director stays true to their vision, but not necessary at the expense of making all the actors just mindless pawns on a film assembly line. Most films are collaborations of many artists, yet remain true to their director's goals.

So if I am looking for a more realistic role-model, I probably want to be more like a director of a modern film when I am leading a big development project. In that way, I want to have my vision implemented, and I want the final product to clearly bear my stamp, yet I want my fellow developers to be able to contribute to the overall project in a way that makes them feel honored and proud to have been on that particular development team. Like the actors, the coders should feel that they were able to make significant contributions, but without disrupting the overall product.

I do at this point, want to warn programmers about thinking that I am saying they should have some infinite degree of freedom while developing software. Not even the directors have total freedom. There is a difference between creative contributions and just plain old chaos. The requirements of the users, the limits of the technology, and the consistency of the conventions all combine to remove a great deal of the degrees of freedom from coding.

For tools to be useful, they must have certain specific functionality that clearly matches a problem domain. You can't creatively work your way around the actual problems, they are what they are. Our existing technological base is awful, but it is the only one we have. If you have to bend your design for a rigid relational database schema in order to allow for reporting, then there is little choice in that matter. And if every other similar application uses widgets in some annoying brain-dead, but extremely common way, you too must follow that or pay the consequences of annoying your users instead of impressing them. A good software product, one that solves a specific problem is heavily constrained long before it is even sketched on the back of a napkin.

In all, if you are implementing a targeted solution for a specific industry, there isn't an infinite array of possibilities for the development. Like a 'genre' in film making, if you break the rules, you had better understand them first, and given a solid reason for breaking them, or people just take you for someone that doesn't get it.

A film like Lord of the Rings, for example, while intensely creative is also highly restricted. Frodo can't be green, and you can't redefine elfs back into funny little forest creatures. The film would flop if it violated the spirit of the books. In the same way, software is constrained by its intended targets.


FINAL CREDITS

In many ways software is unique, but in many ways it is just another method of building things. Our biggest difference is the immaturity which lingers in our development process. We clearly do not want to grow up and start producing better software. But the issue might have been one about what our newer 'maturer' self would really look like. With terms like architect and and processes like waterfall, we've been quietly, but agonizingly following along behind various construction analogies, but with little effect toward reducing the 'software crisis' first identified in the late sixties. Is it any wonder we are resistant to change? Is there actually anything to change to that will really work?

I think if we really don't want to be factory or construction workers, then we need to look carefully at other groups of skill artisans, and start trying to fit their processes into ours. A director for a major motion picture, and his teams of 'creatives' is such a great parallel because it involves a large group of diverse people putting together a big complex work. Its main weakness is that a film is a short-term project, once completed it is over. Because of that, most choices of how to get the work done are not made for the long term. If we did that, it would be a disaster in software. Still, mixing this type of analogy with a construction based one, which is a much longer-term vision, produces something in the middle. We are neither, but we could be both.

I'm probably late with this analogy, as I suspect given the credits on the back of most video games, that at least one segment of the software industry has been quietly following this analogy all along. Still, a unified single vision is the key to bring out high-quality products, and a film director does it better than any other industry. Major motion pictures, at least according to the DVD special features, are fine examples of creative and organizational projects. We should marvel at how they work, and how they manage to bring together so many diverse disciplines into one final amazing product.

Next time someone asks me about what title I want for leading a project, it may be something like: commercial software director. Can an Oscar be that far behind?

Friday, March 21, 2008

Testing for Battleships

In one of my earlier blog entries, I casually suggested that testing was like playing a game of 'Battleship'. It is an odd analogy, but as I pondered it a bit more I realized that there were interesting parallels between playing Battleship and testing, many that were worth exploring.

Often, it is exactly our change of perspective that allows us to see things with fresh eyes; even if it is tired and well-tread ground.


AN INITIAL DEFINITION

For those of you not entirely familiar with it, Battleship was a popular kid's game, a long long time ago, in a galaxy far far away. Wikipedia has a great description of it:

http://en.wikipedia.org/wiki/Battleship_%28game%29

The version I grew up with was non-electronic and consisted of two plastic cases each containing a couple of grids, the 'ships' and a huge number of little red and white pegs.

In the first phase of the game, both players take their five 'ships' of varying sizes and place them on the horizontal grid. After that, one after the other, the players take turns calling out 'shots' by the grid coordinates. The coordinates are alphabetical on one axis, and numerical on the other. A shot to C5 for instance is the third position across and the fifth one down.

If the shot hits a ship, the defending player has to say 'it is a hit'. The calling player puts a red peg in the vertical grid to mark the spot. Otherwise, just so they don't call the same spot again, a white peg is used. The game continues until one player has sunk all of the other players 'ships'. It is a race between who sinks who's fleet first.

The big strategy of the game is to find ways to 'hide' the ships. For instance, lining them up against each other, so it is hard to see where one starts and one ends. Sometimes hiding them on the outskirts of the grid is great, sometimes not. Once the game gets going, it becomes a contest between the two players as to who figures out all of the other locations first. Mostly, the game ends long before all of the pegs have been entered into the board. From my rather random and fuzzy memory, I'd say usually around two thirds of the board was full at the finished. Sometimes more, sometimes less.


OBSERVATIONS

What does this have to do with testing? Well, interestingly enough, testing is kinda like playing half a game of Battleship against the code. You can see all of the functionality as a giant grid, a vast sea that contains a number of hidden bugs. The tester's job is to place tests (pegs) at various points in the grid to see if there is a bug there or not. Each test -- or part of the test -- acts as a little peg that reveals a possible bug. For a test suite, one after another, the testers keep adding pegs until all of the the testing is complete.

Battleship ends when one player has sunk all of the other player's ships. Testing ends, when the time is up or you've run out of little pegs. It is also a race condition, but usually against time or resources.

In the real game, the 'ships' come in multiple sizes that are well known. Bugs on the other hand come individually or in clumps and there is very little rational behind either. In fact, there is a totally unknown number hidden in the grid that can be anywhere from zero to NxN bugs.

Now, most people might think testing occurs with a huge number of pegs, but if you really look at the coverage on a line-by-line, input-by-input basis, the grid is absolutely huge, while the pegs are few. Even if you are going to push the whole development team into an intense round of final system-wide super integration testing, it is still a pitifully small number of pegs verses the actual size of the grid.

One big different between the game and real testing is that the grid grows with each release. The game is ever changing. Even when you think you have mastered it, it will still occasionally catch you by surprise. Although for the analogy I picture the functionality as an N by N grid, the real topology of any system, depending on representation, is some multi-dimensional arbitrary structure, where the sea-scape itself holds many clues into the increased presence or likelihood of bugs.

What I really like about this analogy is that it gets across the point that testing is hit-or-miss, and by definition not everything is tested. People tend to view it as a deterministic process; if you do enough testing, then you will catch all of the bugs. In fact, I've seen 'intensely' tested code, that has run in production for years and years, still yield up a bug or two. There are always a huge number of bugs and they are always buried deep. Unless a peg hits them directly with the right input, the presence of the bug is unknown.


TESTING STRATEGIES

Over the years I've seen many 'interesting' strategies for testing. Sometimes it is all manual, sometimes it is heavily automated. Some testing focuses on catching new problems, so of it is more bias towards regression. Generally, though the most common thing about testing is that most of it is done on the easier bits. The hard stuff rarely gets touched, even though it is the most likely source of bugs. In that sense, developers quickly use up all their pegs clumped together in a close location on the grid, while ignoring the rest of the board. They tend to put the most pegs into the spots where the least bugs will exist.

Unlike the game, in testing the board itself offers tonnes of clues on how to test more effectively, but most developers prefer to disassociate themselves from that knowledge. As if, somehow, if we don't get smart about what we test, then the testing will be more thorough, a possibly workable strategy, but only if you have a huge mountain of pegs. If you have way too few pegs, it's effect is probably opposite.

Software developers hate testing, so it is usually handled very badly. Most big integration tests for example are just replications of unit-tests, done that way to save time. An oddly ineffectual use of time, if the tests both refer back to the same position on the grid. Underneath, some bugs may have moved around but we still need to test for them in their final position. Sticking the first peg in, then removing it, then adding it again is unnecessary effort.

Interestingly enough is that for whole sections of the board, if there haven't been any changes to the code, then the same bugs remain hidden in the same positions. So executing the same peg pattern, or regression testing, will produce the exact same results as last time: nothing. No changes to the code + no changes to the tests = no new bugs, but also a huge waste of time.

If you get the sense of this game, then spending a great deal of time sticking the pegs into the boards only to not find any bugs seems as though it may have been a waste of time. Programmer's like to test to 'confirm' the absence of bugs, but really for most code, particularly if it is new -- less than twenty years old -- not finding bugs more likely means the testing is ineffective, not that the code is good. It ain't. So that one, last pass through the tests, with the expectation of not finding anything wrong is really a big waste of time. You regression test to make sure none of the older bugs have reappeared, but that only makes sense if the code was actually effected. Retesting the same code multiple times, even when it hasn't changed is a standard practice, but still crazy.

Clearly this also implies that any second round of testing should be based on the result of any changes, and a first round of testing. Parts of the board are not worth retesting, and it is important to know which parts of the board are which to avoid wasting resources.


PATTERNS

I can remember when I was a kid, I'd come up with 'patterns' to try and optimize my peg placement for Battleship. You know, like hitting every other spot, or doing a giant X or cross or something. I wanted a secret formula for always winning.

These never worked, and I would generally be pushed back to messing up the pattern because of the nature of the game. Once you've identified a red peg, you want to finish off the ship quickly. Then you get back to searching.

Overall the best strategy was to randomly subdivide as much as possible. Big gaps allow big things to hide. Random shots, distributed more or less evenly across the board tended to be more fruitful.

One thing for certain about Battleship was that no matter who you were playing against, there was some underlying structure to the way they placed the ships. As the game progressed, the strategy for finding the next ships was based on the success of the current strategy. None of the fixed patterns worked because they were too static.

Although, no one would actually be able to explain why they placed the pieces the way they did -- to them it was random -- it just wasn't. Different opponents tended towards different strategies, which after a game or two became fairly obvious.

In a weird, but rather indirect way, the same is true for code. The bugs in the system are not really random, they come from an extremely complex pattern that will always seem random to human beings, but there is some ordered structure. The most effective ways to find them are by using instinct to work through the parts of the system that are most likely to have problems, or to insure that their aren't problems where they might be really devastating.

On fixing a bug, retesting the entire system is a huge waste of resources. It isn't hard in most systems to narrow the board down to a smaller one that might have been effected by the change in code. Finding the true impact of a change may seem expensive, but it is considerably cheaper than retesting everything. In that regard, a well-crafted architecture, one that isn't just a big ball of mud certainly helps in drawing lines around the possible effects of a software change. It makes understanding the impact of a change a lot easier.

Obviously, if you can reduce the size of the board, you increase the chances of finding any other hidden ships on the field, particularly if you didn't significantly decrease the number of pegs.


SUMMARY

There are lots of things we can learn from studying an analogy. They allow us to morph the problem into some other perspective, and then explore the relationships and consequences in some other way. Pushing testing along side a board-game shows the importance of being responsive in the testing process, and not just trying to issue fixed patterns across a huge space. A static approach to testing fails repetitively and is more expensive. A dynamic approach is a more likely way to utilize the available resources.

Clearly shrinking the board but not heavily reducing the pegs will increase the likelihood of finding problems. Also, not just retesting all of the old spots will help to find new hidden dangers. Understanding the topography of the functionality will help in high light areas that need to be specifically addressed. There are lots of strategies for making better use of the resources.

If you see the life of a product as a long series of games against the code, you get an awful large number of chances to test for underlying problems. If you squander that on a massive stack of simple-but-useless regression tests that cover the same unchanging territory again and again, it is no wonder that the product keeps getting released with a significantly large number of problems. Time and resources are not making headway into the real underlying territory.

If you craft tests as absolute sets of instructions, then recrafting them for new games is hard and a huge amount of work. Since we hate testing, we shy away from any work in that area. As such, even projects that do everything 'correctly' and have the 'proper' suites of regression tests and always test 'before' each and every release, let large qualities of bugs through. In fact, the ones that followed the established best practices are probably of the average or worse quality because the programmers tend to shift their minds into neutral during the testing phase. It is the guys with no resources that are anxious about not wasting what little effort they have, that are getting the most bang for their buck.

Tuesday, March 11, 2008

The Science of Information

My last couple of blog entries have -- once again -- been procrastinations from finishing off a difficult post. Most often, my difficulties in completing a piece come from uncertainties about the essence of what I am saying. Probably a good thing, as I am often trying to stretch my understanding to a new level.

In this case, since what I'm talking about is what I think can be, there is no foundation for me to fall back on. I can't check my memories, or facts, or even look to other people for help. I have a vague outline of what I really want to say, but it's not as entirely as certain as the words I used to describe it.

Because of that, I would suggest reading this one lightly, it may prove to be substantial, but then again it may not. Of course, comments are welcome, and any addition thoughts are always appreciated.


CHANGING TIMES

The 20th century tore a hole right through Newton's deterministic world. On impact, his simple discrete model of cause and effect in a linear existence seemed to shatter into a million pieces.

Our advancements were amazing. The special theory of relativity redefined how we saw geometry. We no long live in a flat space, instead it is curving and bending around us. Quantum physics shifted our foundations, basing them on probability rather than actual matter. Things are now so little that we can only guess where they are. Chaos theory showed us that relationships aren't necessarily linear, small things can have huge unexpected effects. This gave us reasonable explanations for phenomenon like weather, but left us unable to predict it. And fractals revealed the naivete of our calculus by opening it up to allow recursive, constantly morphing functions that are infinitely deep. Now we can relate the complexity of detail back to our mathematics, as more than just a simple subset of common shapes.

Our models of the world around us have taken huge leaps. In this, for all that we lost in realizing the world is not a simple deterministic place, we gained it back again in understanding complex behaviors like fluid motion, weather patterns or other previously unexplainable material events. The world around us may not be simple, but it is still possible to understand it.

In one of my earlier blog entries, "The Age of Clarity", I predicted that the we would pass through a new Renaissance driven by our increased understanding of the information around us. To some degree, I suspect that many people will assume that I am falling back into thinking that the world is simple and deterministic; that this new age will allow us to model it conclusively with sciences similar to classic physics.

From our vast increases in knowledge I am certainly aware of the dynamic nature of the world around us, but that does not stop me from envisioning a discipline that can fully identify and integrate these behaviors into our world. Just not deterministically. All things, as we know them, leave an imprint on the fabric of space and time around us. We are surround by vast amounts of information, we can record it and understand its structure, its inter-relationships, even if we cannot entirely project how it changes in the future. We can project some changes, particularly large ones, even if we cannot aligned to them over time. We know more than we realize, but we understand less than we know.


MONITORING CHANGE

At some point in our not-to-distant future, I would expect that we could easily be able to collect any relevant information for real-life things or events. With this information we would be able to process it in a way that would allow us to model the effect of at least large changes, with some extraordinary degree of precision. With this type of model, we will be able to understand how changes have positive and negative effects, we will also be able to determine whether or not things are getting better or worse.

More specifically, I mean that we should be able to monitor an organization like a government, measuring its important aspects. This we should be able to accomplish quickly, possibly spending only a couple of days setting up the process. From there we would be able to model the key variables, with a full and complete understanding that any 'non-monitored' variables are 'non-essential' and that their impact on the overall model is insignificant.

By virtue of our knowledge, we will know absolutely all of the variables that are significant for this type of model and all of the variables that are not significant. And we will be right, because what we are doing has rigour. We will able to understand and collect the significant pieces of information that relate back to our real world. With this type of model, we can then exploratively make changes and see their effects.

As the government considers new laws and regulations, we should be able to alter the model to get an estimate of the effect of the changes. As the changes are made, we should be able to get a quantitative feedback that proves that the changes are having a positive effect. In the end, we should know, conclusively that we have in fact improved the government; not a guess or a gut feel, but solid evidence that things are better to some degree.

Now, that type of statement is easy in a deterministic world, but in ours, modeling an organization for instance and introducing changes, then recalculating a new direction to any accurate degree is nearly impossible using today's understandings. We need to guess at what information to capture, and then guess at how changes influence it. We don't have enough of an understanding of either, but that won't always be true.

Our goal then is too be able to measure the correct things, model them completely and then project these models forward to some degree of previously unobtainable level of accuracy. All extremely hard tasks. With our current understandings they are nearly impossible and definitely not rigorously defined.

Our current abilities, when we computerize things is to spend years flailing away at the design of large systems. Often we guess about the structure of the data, and spend a great deal of our effort trying to slowly fix the serious problems caused by our poor initial guesses. Even when we have collected huge piles of data, we barely know how to go through and mine it for simple things. If we know what we are looking for it is easier, but still difficult. It takes years, often fails and is never particularly rigorous.

There are lots of available technologies, books, theories, etc. but none of them are fully objective. We kinda know how to do it, and it kinda works, but often fails.

Another aspect I expect for the future is that in assembling large piles of data, we will be able to decompose them into their elementary influences. Thus new and unexpected relationships will get discovered, just because of the amount of data we have collected, and the process of finding this relationships will be rudimentary. Once we understand the structure of data, we will also understand how it relates to other data in the world around us. Mining those relationships will be easy.

Our systems in the future will just present to us all of the causal relationships based on the increasing data pile size. We will no longer have to guess, and then go in search of the answers to back up our assumptions. Instead we will assemble the data, and the answers will be plainly visible (although it may still take us lots of time to accept them).


THE MISSING SCIENCE

To get there, we'll need at least one new science, if not many. For these goals we need a way to really understand information, including both its structure and its relationships.

One of our truly powerhouse accomplishments as an intellectual species, has been the creation and refinement of mathematics. In a very real sense we can see it as being a rigorous way of relating together numerical and symbolic qualities in an abstract container that are always universally correct. It's the absoluteness that is key.

We have a huge number of different 'branches' of mathematics, dealing with a wide range of axioms and theorems. Some are quite disparate from each other, but they all hang together because they are abstract, they are rigorous and they deal with numerical and symbolic entities.

If we stick our head directly into the abstract plane where mathematics hangs out and squint a bit, we can see other 'things' there as well.

We know 'information' and using information theory we know for instance that there is some undeniable minimum size for any piece of data. Claude Shannon proved that 'information' can only be compacted so far down, and no more. In that 'sense' information has a physical properly in our world, even if we can't see it. Information is numbers and symbols, but it is also so much more. Some branches of mathematics do 'indirectly' deal with the relationships of data, but their focus is on the numerical or on the symbolic.

We know we need a new science to apply rigour on top of our data, and we know that we want it to be an abstraction like mathematics, but dealing with information instead. I like the term 'informatics' as a name for this new science, but there is already an existing definition. The dictionary definition for informatics relates this term to Computer Science and to the study, storage, retrieval and classifying of information, however I would really like to dis-connect this term from any underlying hardware, software or even reality.

I'd prefer to recast this definition as the 'abstract' study of the structure and relationships of information, focusing on the shape and interrelationships between datum. Like numbers and symbols in mathematics, data may be markers for the real-world, but if we contain them within a rigorous abstract foundation, we can fully understand their nature prior to it becoming clouded by reality.


STRUCTURE AND RELATIONSHIPS

For everything we need a foundation; for this we want to study the abstract structure and relationships of data that mirror our real-world.

All information in our real world has a placeholder in any abstract one. We want to be able to conclusively say interesting things about the structure and relationships of the various placeholders. We want to be able to know how they are organized and see how they change over time.

Right now for instance, we may know some of the elements of how to model complex some real world entities, such as bank accounts or financial instruments. A lot of work over the last 50 years has gone into gradually producing more reliable models. Interesting, virtually none of it has been formal in anyway. We may know a lot about creating abstract models for financial instruments for example, but this is experimental propriety custom knowledge that is not shared.

The next step is to put structure and rigour to this knowledge. We don't just want a huge number of custom models, we want some unifying theory of how to reliably model the world around us. We need to dig into the foundations of information.

All information has a name. In fact most of it has many names. The names usually belong to complex taxonomies, and often have instance-related elements. For example, $128.02 is the price in CAD on June 5th, 2007 of a slice of a 1995 Canadian Govt Bond as traded in Toronto. BondPrice is a shorter name, and relative to things like market (Toronto) and date (June 5th).

Beyond that, a Bond is a type of financial instrument which is a type of contract, which is an agreement between two parties. So, we have both instance information and type information. For our purpose in this example we can mix the instance, type and categorical information, not distinguishing between the different varieties.

In an abstract sense, we can see the various breakdowns as being structural information about the underlying data. From some point, relative or absolute, we can construct a long-name, as a series of labels, that leads us through the structure and down to a specific datum.

The various parts of the long-name (price, date, instrument name, market) all form a pathway through the structure to describe the data. A collection of related pathways for similar data outlines the complete structure. An instance of price/yield information for a specific day, in a specific market is a data-structure. The static properties of the bond, its name, type, creation date, etc. all form a data-structure. These structures, plus many others form the structure of financial instruments. All data has some underlying structure that relates it various parts to each other.

While the pathways may be relative, or change with language, science or other categorical process, the underlying type structure of information never changes. Its structure is its structure. The idea is that underneath the same basic thing exists whether we use French or English to describe it. It exists whether or not we describe it relative to ourselves or to some fixed point like the universe. Given two different descriptions of data, if they are close to complete, then they are close to having the same structure and inter-relationships.

That is a huge idea, the notion that although our way of representing data at the high level changes, the underlying structure of data is immutable. Fixed. It does not change, it exists in the world with a specific static structure. The world changes around it, but it is the same in any language, with a relative or absolute point of reference.

Oddly, the static-nature of the information is there by definition. If the structure of the data is different, then the thing being described is different. They might be related, and similar but from a structural standpoint differences mean things are not the same. Not that that implies that instances cannot vary from each other, that will also always be the case. Dogs always have legs, usually four, and they come in a variety of colors and sizes. Variations are normal, changes in structure are not.

To utilize this, we need a way to objectively start with one datum, and use it to understand rest of the structure of the information we want to capture. One way, for now is to start assembling meta-models of all of the different data types. If we assemble models of financial instruments, social organizations, etc. We will be able to understand their existing structure. For every industry were we can model, we can be objective about the structure.

At some point however, understanding the structure should get beyond observation, particularly as we build up larger and larger abstractions for containing our world. We'd like a deterministic way of walking the information that by definition proved that the structure was actually correct. An algorithm for reliably mapping any point of data in a bigger picture.


MAPPINGS AND DIMENSIONS

All things are different to varying degrees. The strength in understanding any structure is being able to find isomorphisms and other mappings between the data at different levels. There are rather obvious mappings such as changes in language or terminology, and there are more exotic ones such as the similarity in structure between the events for different objects, or common patterns of data structures. Whatever type of mapping, it provides a way in which we can get a higher-level understanding of the underlying information. We always generalize to build up our knowledge, based on lots of individual bits.

For some types of information, the decomposition at some level is arbitrary. In that sense, it falls into some set of primitives at a specific level in the structure. As we seen, the underlying structure is immutable, then for some higher-level set of primitives where there are multiple ways of expressing them, there must be a mapping between the different sets.

An easier way of saying this, is that if we express something in terms of AND, OR and NOT primitives, at the same level, NAND will work as well. Anything expressible in the one set of primitives is also expressible in the other. Because they are the same, there is a mapping between them (although not necessarily one-to-one or onto).

If in a different language for example, there might be 50 words that describe types of snow. The mapping from English to the other language will have to include descriptive verbs as well as the noun 'snow'. Because the primitives decompose differently, that doesn't make the underlying thing different, it is just an alternative geometry for expressing it.

Different primitive sets will have different sizes, but still represent morphisms from one of set to another. We can show the expressiveness, and the coverage of each of the primitives and prove that the coverage of one set is identical to the coverage of another, thus the two are essentially the same. Thus information broken down into one set of primitives can be mapped to some other set of equivalent primitives.

Besides primitives, there are some interesting mappings that occur because all of the information in our real world exists in four dimensions. In our language we use nouns and verbs to describe the things in our space.

Nouns tend to be tied to objects sitting our our 3D space of geometry, while verbs, being actions are more often tied to our 4th time dimension. We also relate our 'business' problems in terms of nouns and verbs. We can by definition discussion all aspects of business using our verbal language. Putting structure directly to our communication decreases the likelihood of misinterpreting it.

Our language itself underneath is tied to the structure of our physical existence. That is an expected consequence because language is in a very real sense is just another space on which to bind real-world things onto abstract concepts such as words and sentences. Languages are loosely-structured, ambiguous, non-rigorous models of the world around us, but they are still models.

Similar mappings are important, but they are not the only class of mappings that have interest or value. Knowing as we do, that much of our world is 'fractal-based', we need some less rigid concept for mappings that are similar but not truly isomorphic, possibly a term like 'fractal-morphic', with a meaning of a twisted or skewed mapping. The scope of the mappings is complete, e.g. both sides are the same size, but the relationships are transformed with respect to one or more dimensions. A sense of degree would help redefine the concept, so we could say something was 2 degree fractal-morphic to something else, meaning that it was skewed in no more than 2 dimensions.

A corner-stone of any understanding of information will be creating rigorous definitions for these different types of similarities. If we could for instance boil the world down into a few dozen or so major meta-structures, then it becomes very easy to determine the properties of the data. What we know in general can help us to deal with the specific.


A SIMPLE HIERARCHY

Stepping back up to the higher-level, we have two abstract bases of theory. One specializes in numerical and symbolic qualities: mathematics, and the other specializes in structural and relationship attributes of information. As pieces, these two then fit into each other quite nicely.

With the various branches of mathematics we manipulate numbers or symbols in numerical ways. It is possible for one to believe that numbers form the base layer of all things, so that mathematics is and will always be the primary abstract field. However, where we can find mathematical expressions for the structure and relationships, we can also find structural and relationship expressions for the mathematics.

The structure and relationships of information is similar, but in many ways my expectation is that it supersedes mathematics. In a way, mathematics is the concrete 'relationships' between the abstract things in an abstract place, so the relative structure and abstract relationships should be a higher level of abstraction.

Irregardless of where it fits in, the two key properties for this new science of information are that it is abstract and that it is rigorous. Both are necessary to position it for maximum usability. The foundations need to be self-evident and universally true. It should not matter were you are, the basis is never changing. Even though we are dealing with 'real-world' information, the essence of what we are working with is totally abstract.

Some things will be tied to concrete entities, even thought the science itself is abstract. Some things will not. The parallel to mathematics can be seen in simple things like 'complex' numbers. They do not exist in our world, yet they can be the roots of equations that do have physical bindings. In that same way, there will be meta-data that comes to play, but has no real world equivalent. That arbitrariness may seem confusing, but it is not uncommon to have to leave the base dimensions of a problem to find or quickly compute a solution, some linear programming algorithms for example do that.


SOCIAL FACTORS

Some of what I image possible is the ability to better understand our social organizations. There are consequences to owning this knowledge.

The 20th century saw tremendous strides towards making manufacturing consistent. The idea that we can monitor, and control for quality the physical output of a lot of people, in an almost mechanised way made a huge difference in manufacturing. The modernization of factories was significant. It caused upheaval, but ultimately our societies integrated the changes.

The revolution was stunning, but it was really confined to the way we saw physical activities. One of the things we can't do, is apply those rules towards intellectual endeavours. We learned how to consistently build a model-T, but not how to consistently fill out a form or run a meeting.

Although using muscles is different than using the brain, the two share a relationship to each other. The biggest difference being that intellectual effort can't be visibly measured. This seemingly makes it an intangible quality like confidence or morale. Yet we know it is there and we know the output of it. We also know for instance that bad morale reduces physical and mental output, so it in itself is not really intangible. We just don't know how to measure it properly.

One of the things we should be able to do with our advanced understandings, is see these things based on their imprints from the world around us. In large organizations like governments, we should be able to determine how effective is the organization, and how many problems it has. We should be able to see the effects of 'intangible' qualities, and see how changing simple things improves or degenerates the circumstances. Intangible things may not be physical, but by definition they have some sort of an effect, or they are imaginary.

Some day we will be able to model these qualities. To get there we need to cross over the point where we understand how to mechanize intellectual effort because we need to relate it back some type of underlying model. This is the fear of all white collar workers, that we can measure their work. But until we can, we can't tell for instance how effect a social policy is. It is a horrible trade-off, but a necessary one.

It is the dream of the executroids, and the night-mare of the workers. But we must remember that factory-workers survived, so shall the white-collar workers and middle management.


SUMMARY

At some point, with each new discovery we shed light on a new and smaller, more refined set of details. Then over time as more details emerge we often coalesce our knowledge back into larger abstract theories. In a real sense we see the trees first, then the forest. We are forever in a cycle of slowly building up our knowledge.

It is amazing the things we know, but it is also amazing how we cannot truly verify them. Everyday we are bombarded by facts and figures based on bad science and dubious mathematics. Underneath all of this, is our use of computers to amplify the amount of data we are collecting and mis-interpreting.

We need help in quickly identifying bad information. You could spend time in a study for instance, to prove that the process was flawed, but it would be far easier just to discount it because it failed to use proper information gathering techniques. Since we don't have those specified, we have to go to the results on a line-by-line basis even thought we can easily suspect that the problem was the initial data gathering process.

Because of this, too much that should be objective, has been rendered subjective. We cannot be sure of its validity, and now often different groups are openly contradicting each other. We've fully fell into the age of mis-information, because we have easy ways of building up large piles of information but no easy ways of validating that those piles actually match reality in any shape or form.

It was computers that created this problem, so it will have to be computers that get us out of it in the end. Once we understand how to understand information, we can then go about recollecting it in ways that will really provide us with true underlying knowledge. With enough of an understanding, much of what is now very murky, will come clear.

We lack an understanding of our information, but now at least we know what it is that is missing. If our next step is to create a real science of information, one that is undeniably and universally true, then we can give ourselves the tools we need to really make sense of the world around us. Realistically it is not that complicated to get a Renaissance, if you can figure out that you are a pre-Renaissance culture. It is accepting where you are that is hard.

Tuesday, March 4, 2008

In Object-Orientation

"Familiarity breeds contempt" is a common cliche. It sums up an overall attitude, but for technology I don't think it is that simple.

I am certainly aware that the more we work with something, the more we come to really understand its weaknesses. To the same degree, we tend to overlook the most familiar flaws just because we don't want to admit their existence. We tend towards self-imposed blindness, right up to the instant before we replace the old with the new. It was perfect last week; this week it is legacy.

Software development has once again come to that state where we are searching for the next great thing. We are at a familiar set of cross-roads, one that will possibly take us to the next level.

In the past, as we have evolved, we have also dumped out "the baby with the bath-water", so to speak. We'd go two steps forward, and then one and three-quarters of a step back. Progress has been slow and difficult. Each new generation of programmers has rebuilt the same underlying foundations, often with only the slightest of improvements, and most of the old familiar problems.

To get around this, I think we need to be more honest about our technologies. Sure, they sometimes work and have great attributes, but ultimately we should not be afraid to explore their dark sides as well. All things have flaws, and technology often has more than most. In that, nothing should really be unassailable. If we close our minds, and pretend that it is perfect, we'll make very little progress, if any.

For this blog entry, I wanted to look at the Object-Oriented family of programming philosophies. They have a long and distinguished history, and have become a significant programming paradigm. Often, possibly because of their history, most younger programmers just assume that they work for everything, and that they are the only valid approach to programming. They clearly excel in some areas, but they are not as well rounded as most people want to believe.


EVOLUTION

Although it is hard to establish, the principles of Object-Oriented (OO) programming appear to be based around Abstract Data Types (ADT). The formalization of ADTs comes in and around the the genesis of the first Object-Oriented programming language Simula, in 1962. In particular, The Art of Computer Programming, by Donald Knuth which explores the then commonly known data structures, was also started in 1962 although the first volume wasn't published until 1968. Certainly, whatever the initial relationship between the two, they are very close, yet they are very different.

Abstract Data Types as a movement is about using data structures as the basis of the code. Around these data structures, the programmer writes a series of primitive, atomic functions that are restricted to just accessing the structure. This is very similar to Object-Orientation, except that in ADTs it is a "philosophy of implementation" -- completely language independent -- while in OO it is a fundamental abstraction buried deep within the programming language.

One can easily see that OO is the extension of the ADT ideas into the syntax and semantics of the programming languages. The main difference being that ADTs are a style of programming, while OO is a family of language abstractions.

Presumably, by embedding the ideas into the language, it makes it more difficult for the programmers to create unstructured spaghetti code. The language pushes the programmer towards writing better structured code, making it more awkward to do the wrong thing. The compiler or interpreter assists the programmer in preventing bad structure.

While ADTs are similar to OO -- you set up a data-type and then build some access functions to it -- at a higher level there is no specific philosophy. The philosophy only covers the low-level data structures, but absolutely nothing is said about the rest of the code. Practice, however is to use structured code to layout each of the high-level algorithms that are used to drive the functionality.

This means there is a huge difference between ADTs and OO. Outside of the data structures, you can fall back into pure procedural style programming in ADTs. The result is that ADT style programming looks similar Object-Oriented at the low-level, but is really structured as 'algorithms' at the high-level. A well-structured ADT program consists of a number of data-structures, the access functions and a series of higher-level algorithms that work the overall logic of the system. Since data drives most of the logic in most programs, the non-data parts of the code are usually control loops, interfaces to dispatch functionality or glue code to interfaces between different control loops. Which ever way, they can be coded in the simplest most obvious fashion, since there are no structural requirements.

That ambiguity in defining the style of the code at a high level in ADTs is very important. This means that there is a natural separation between the data and the algorithms, and they get coded slightly differently. But we'll get back to this later.


CODING SECRETS

Software development often hinges on simplifications, so it is not surprising that we really only want one consistent abstraction in our programming languages.

Because of this, we spend a lot of time arguing about which approach is better: a fixed language, that is strongly-typed or one that is loosely-typed. We spend a lot time arguing about syntax, and a lot of time arguing about the basic semantics. Mostly we spend a lot time arguing about whether or not the language should be flexible and loose, or restricted and strict. It is a central issue in most programming discussions.

If you come an impasse enough times, sooner or later you need to examine why you keep returning to the same spot. Truthfully, we work in two different levels with our implementations. At the higher level, the instructions are more important. At the low level it is the data.

At the low level we want to make sure the data that we are assembling is 'exactly' what we want.

When I was building a PDF rendering engine in Perl, for example, I added in an extra layer of complexity. The engine was designed to build up a complicated page as a data-structure and then traverse it, printing each element out to a file. This type of processing is really simple in a strongly typed language, but can get really messy in a loosely typed one. To get around Perl's loose semantics I wrapped each piece of data in a small hash table with an explicit type. With the addition of a couple of really simple access functions, this had the great quality of emulating a strongly typed syntax, both making the code easier to write, but also guaranteeing less errors. It was a very simple structuring that also made it really easy to extend the original code.

At the higher level we want to focus on the instructions and their order, data is not important. Batch languages and text processing tools are usually loosely typed because it makes more sense. The data is really minimally important, but the order of the functions is critical.

As another example, I was building a dynamic interface engine in Java. In this case, I wasn't really interested in what the data was, only that at the last moment it was getting converted into whatever data-type I needed for display or storage. This type of problem is usually trivial in a loosely-typed language, so in Java, I added in a class to 'partially' loosely type the data. Effectively, I was concerned with 'removing' via a single call, any of the attributes of the type of the data. It does not matter what the data is, only what it should be. Again it is a simple structuring, but one that effectively eliminated a huge amount of nearly redundant code and convoluted error handling.

With enough coding experience in different languages, you come to realize that being strict at a low-level is a great quality, but being flexible at the high level is also important. It is easy enough to get around the bias of the language, but it does count as some additional complexity that might prove confusing to other developers. It is no wonder that the various arguments about which type of language is superior are never conclusively put to rest, it is more than a simple trade-off. There is an inherent duality related to the depth of the programming. The arguments for and against, then are not really subjective as much as they are just trying to compare two completely different things.

Well-rounded implementations need both strong-typing and loose-typing. It matches how we see the solutions.


INTERNAL MODELS

Another fundamental problem we face, comes from not realizing that there are actually two different models of the solution at work in most pieces of software.

The way a user needs to work in the problem domain is not a simple transformation from the way the data is naturally structured. The two models often don't even have a simple one-to-one mapping, instead the user model is an internal abstraction that is convenient for the users, while the data model is a detail-oriented format that needs to be consistent and strictly checked.

A classic example is from a discussion between Jim Coplien and Bob Martin:

http://www.infoq.com/interviews/coplien-martin-tdd

Specifically Jim said:

"It's not like the savings account is some money sitting on the shelf on a bank somewhere, even though that is the user perspective, and you've just got to know that there are these relatively intricate structures in the foundations of a banking system to support the tax people and the actuaries and all these other folks, that you can't get to in an incremental way."

In that part of the conversation Jim talked about how we see our interaction with the bank as a saving account, but underneath due to regulatory concerns the real model is far more complex. This example does a great job of outlining the 'users' perspective of the software, as opposed to the underlying structure of the data. Depending on the usage, a bank's customers deal with the bank tellers operating the software. The customer's view of their accounts is the same one that the system's users -- the tellers -- need, even if underneath, the administrators and regulators in the bank have completely different perspectives on the data.

While these views need to be tied to each other, they need not be simple or even rational. The user's model of the system is from their own perspective, and they expect -- rightly so -- that the computer should simplify their view-point. The computer is a powerful tool, and one of its strengths is to be able to 'transform' the model into something simpler allowing the user to more easily interact with it, and then transform it back to something that can be stored and later mined.

It is a very common problem for many systems to have the software developers insist that there is only one 'true' way to look at the data. The way it needs to be structured in the database. It is not uncommon to see forms tools, generators or other coding philosophies that are built directly on the underlying data model.

This makes it possible to generate code and it cuts down on the required work, but rarely is the data model structured in the same way as the user needs. The results are expected. The user's can't internally map the data to their perspective, so the applications are extremely awkward.

A single model type of design works only for very simple applications where there is little mapping between the user model and the data model. Mostly that is sample applications and very simple functions. While there are some applications of this sort, most domain specific systems require a significant degree of complex mapping, often maintaining state, something the computer can and should be able to do.

The essence of this, is that the 'users' have a model that is independent of the 'data' model. E.g. The interface translates between the real underlying structure of the data, and the perspective that the user 'needs' to complete their work. More importantly the models are not one-to-one, or we would be able to build effective translators. This duality and inconsistent mapping is why code generators and forms tools don't work. You can't build a usable application from the data model, because it is only loosely tied to the user model.


OBJECT ORIENTED

So what does this have to do with Object-Oriented programming?

The structure of an Object-Oriented language makes it relatively easy to model the structure of the data in the system. At the lower data level, the semantics of the language allow it to be used to build complex systems. You create all the objects that are needed to model the data, or the user abstraction. Object-Oriented languages are at their height when they are encoding easily structured objects into the language.

The classic example is a GUI system where the objects in the system are mapped directly to the visual objects on the screen, that type of mapping means that when there are problems it is really easy to go back to the screen and find the errors. The key here, is that one-to-one link between the visual elements and the objects. The point of going to a lot of extra effort, is to make it easy to find and fix problems. You pay for the extra effort in structuring the OO code by getting reduced effort in debugging it.

At the higher level, all of the OO languages totally fall apart. We add mass amounts of the artificial complexity into the works just to be able to structure the problem in an Object-Oriented manner. Ultimately this makes the programs fragile and prone to breakage. Ideas such as inversion of control, and dependency injection are counter-intuitive to the actual coding problems. They become too many moving parts.

Now, instead of the problem being obvious, you may need to spend a significant amount of time in a debugger tracing through essentially arbitrarily structured objects. Constructs like Design Patterns help, but the essence of the implementation has still moved far away from the real underlying problem. The developer is quickly buried in a heap of technical complexity and abstractions.

If what you wanted was a simple control loop, a traversal or some hooks into a series of functions, it would be nice if the simplicity of the requirement actually matched the simplicity of the code.

Even more disconcerting, we must leave our primary language to deal with building and packaging problems. We use various other languages to really deal with the problems at the highest level, splitting the development problem domain into pieces. Fragmentation just adds more complexity.

Our build, package and deploy scripts depend on shell, ant or make. We could write the build mechanics in our higher level language, but writing that type of building code in Java for example, is so much more awkward than writing it in ant. If we see it that clearly for those higher-level problems, does it not seem as obvious for the higher-level in the rest of the system?

Another critical problem with the Object-Oriented model is the confusion between the user model, and data model and the persistent storage. We keep hoping that we can knock out a significant amount of the code, if can simplify this all down to one unique thing. The Object-Oriented approach pushes us towards only one single representation of the data in the system. A noble goal, but only if it works.

The problem is that in heading in that direction, we tend to get stingy with the changes. E.g the first version of the application gets built with one big internal model. It works, but the user's quickly find it awkward and start making changes. At different depths in the code, the two models start to appear, but they are so overlapped it is impossible to separate them. As integration concerns grow, because more people want access to the data, the persistence model starts to drift away as well.

Pretty soon there are three distinct ways of looking at the underlying business logic, but nobody has realized it. Instead the code gets patched over and over again, often toggling between leaning towards one model, and then back again towards another. Creating an endless amount of work. And allowing the inconsistencies tp create an endless amount of bugs.

If the architecture identifies and encapsulates the different models from each other, the overall system is more stable. In fact the code is far easier to write. The only big questions comes about with deciding the real differences between the user model and the data model. When a conflict is discovered, how does one actually know that the models should differ from each other or that both are wrong?


SUMMARY

Ironically, even though they are probably older and less popular, ADTs were better equipped for handling many implementation problems. Not because of what they are, but because of what they are not. Their main weakness was that they were not enforced by the language, so it was easy to ignore them. On the plus side, they solved the same low-level problem that Object-Oriented did, but allowed for better structuring at the higher level. It was just that it was up to the self-discipline of the programmer to implement it correctly.

I built a huge commercial system in Perl, which has rudimentary Object-Oriented support. At first I tried to utilize the OO features, but rapidly I fell back into a pure ADT style. The reason was that I found the code was way simpler. In OO we spend too much time 'jamming' the code into the 'right' way. That mismatch makes for fragile code with lots of bugs. If you want to increase your quality, you have to start with trying to get the 'computer' to do more work for you, and then you have to make it more obvious were the problems lay. If you can scan the code and see the bugs, it is far easier to work with.

We keep looking of the 'perfect' language. The one-size-fits-all idea that covers over both the high and low level aspects of programming. Not surprisingly, because they are essentially two different problems, we've not been able to find something that covers the problem entirely. Instead of admitting to our problems, we prefer to get caught up in endless arguments about which partial solution is better than the others.

The good qualities of ADTs ended up going into the Object-Oriented language design, but the cost of getting one consistent way of doing things was a paradigm where it is very easy to end up with very fragile high-level construction. Perhaps that's why C++ was so widely adopted. While it allowed OO design, it could still function as a non-OO language allowing a more natural way to express the solution. Used with caution, this allows programmers to avoid forcing the code into an artificial mechanism, not that anyprogrammers who have relied on this would ever admit it.

The lesson here I guess is that the good things that make Object-Oriented popular, are also the same things that make it convoluted to some degree. If we know we have this duality, then for the next language we design for the masses, we should account for this. It is not particularly complicated to go back to ADTs and use them to redefine another newer paradigm, this time making it different at the low and high levels. Our own need for one consistent approach seems to be driving our technologies into becoming excessively complicated. We simplify for the wrong variables.

We don't want to pick the 'right' way to do it anymore, we want to pick the one that means we are most likely to get the tool built. Is a single consistent type mechanism better than actually making it easier to get working code?