The Programmer's Paradox: December 2007

Wednesday, December 26, 2007

A Silly Little Storm

This posting looks into a small and common problem, but one that frequently causes larger issues. Even the tiniest of problems, which starts as a small rock in a stream disrupts behind it an increasingly larger segment of water. This type of turbulence exceeds the sum of itself as more rocks combine together to cause larger and larger ripples. With enough rocks you get rapids; the whole being much larger than its pieces.

Sometimes when I am commenting on other writer's blog entries I just want to end my comment with a little bit of shameless self-promotion. Usually my name and then a link back to my blog to make it easy for readers to stop by and visit me:

Paul.
http://theprogrammersparadox.blogspot.com

These are my favorite last two lines in most comments.

Of course, to avoid mistyping the URL I generally go back to my blog's root page, highlight the URL, copy it and then paste that into the comment box a couple of times. Pasting decreases the likelihood of typos; usually a good thing.

Someone pointed out that in Blogger I should wrap the URL in an anchor tag so that it is easier for any readers to click on the link. No problem. That seems easy enough. The simplest way to accomplish this is to type in the starting tag '<a href=', paste in the URL, close it with '>'; then paste in the URL again -- to make it the visible text as well -- and finally add in the closing '</a>'. How hard is that?

In Blogger, when I go to the root page for my blog, I get an extra forward slash '/' character because it is at the 'top' level, and not pointing to any specific entry. If I selected a blog entry then the URL ends with the '.html' extension because it is directly referencing a file. But normally if I want the top level URL for my blog, I'll go to the main page and copy the URL with its extra forward slash at the end.

Funny enough, in Blogger there is this stupid little 'issue' where an anchor tag in the form of "<a href=XX />link</a>" that has an extra '/' at the end of the open tag works perfectly in the preview, but is slightly modified when actually posted. In the preview it appears as if the forward slash is ignored -- everything looks fine, which is no doubt a browser dependent behavior -- and FireFox interprets the tag as expected.

But once you post the comment:

<a href=http://theprogrammersparadox.blogspot.com/>
http://theprogrammersparadox.blogspot.com/</a>

it gets turned into:

<A HREF="http://theprogrammersparadox.blogspot.com" REL="nofollow"/>

Visually the hyperlink text disappears and any of the following text becomes the link instead. As well, the newlines get turned into breaks and there are paragraph markers inserted at the beginning and end of each paragraph.

I am guessing, but when you actually post the comment it is sent to the database for storage. On the way in or out, but usually in, the comment text is modified by some well-meaning, but obviously destructive code. That 'function' is looking to perform some modifications on the comment text; from above we see it added an REL option to the anchor tag; it also converted the tag to uppercase. Because it sees the anchor tag as unary, it seems to strip the orphaned closing '</a>' tag. It probably does lots of other stuff as well. The parsing for that function saw the close forward slash and assumed that it was part of the HTML tag syntax, unlike the browser that sees it as part of the earlier directory path syntax. The browser's parsing behavior is described as 'greedy' because it tries for the longest possible token, which in this case is the full directory path with the appended forward slash. The internal function on the other hand is probably ignoring the directory path syntax and is just looking directly for '<', '>' and any '/'s to modify them. Since it isn't fully parsing the HTML it doesn't have to understand it as well as the browser does.

So, wham, bam, I copy the URL into the text a couple of times, check it in the preview and kapow, I post it. "Oops, what happened there?" Suddenly the text and the link looks very different from the preview. If you combine both of the above issues that simple little extra '/' picked up in the URL causes quite an annoying problem, but not one that is detectable in preview. Everything looks great until you actually post. A very embarrassing error.

You may have noticed that I didn't wrap the href argument in double quotes. That certainly would have fixed the problem, but why toss in an extra set of double quotes when the preview has already shown that not having them will work properly. It is just extra stuff that doesn't seem necessary. Dropping the trailing '/' works too. But the 'problem' isn't what is actually typed, the problem is the preview. It is incorrectly verifying that stuff is working, when clearly it is not.

The cause here beyond my own sloppiness starts with two different points in history where at least two different programmers assigned two different meanings to the same underlying forward slash character: one as the separator between directory names in a path string, and the other as a modifier to the tag syntax to indicate a unary tag or an ending tag depending on placement. These two distinct meanings 'overload' the meaning of the character so that when combined in this specific circumstance the true meaning is 'nearly' ambiguous; one block of code sees it one way, while the other sees it differently. When we 'paste' together different 'idioms' that have been crafted over time, we often get impedance mismatches. Little blots of dangerous complexity.

Added to that, two different pieces of code are parsing the same HTML fragment very differently. However close Blogger's HTML parsing routine is to FireFox, it is different enough to cause problems. I could try testing this in Internet Explorer, but it doesn't really matter to me who is right and who is not following the standard. FireFox is being greedy, Blogger is not. More likely is that to greedily parse the directory path, Blogger's code would have to go to a huge amount of extra work, most of which was an unnecessary level of complexity beyond what they actually needed to manipulate the comments. They won't ever parse the HTML fragment correctly, because they do not need to go to that level of understanding.

What is important is that the same basic 'data' is being interpreted in multiple ways by multiple different programs and that while it is not ambiguous, it might as well be, since any two coders cannot and will not parse it identically. Partial parsing will always produce ambiguities. Overloading will too. The meaning of our data is so often relative to how we handle it in our code. We fail to separate the two; implicitly buring knowledge into some piece of code.

Topping it all off, some programmer implemented the preview function without fully understanding what he or she was doing. If there are 'translations' done in other parts of the system then it is exactly these 'translations' that need to exist in the preview function. Without them preview is useless. Incomplete. A 'mostly' preview function is a sloppy dangerous featurette that mainly misleads people and diminishes the tool. In the end had this been implemented correctly, none of the other issues would matter or be of any consequence. I would have seen the broken link in the preview and fixed it. Although earlier problems set up the circumstance, the 'buck' stops at the final and most obvious place.

We might quickly dismiss this type of issue as trivial -- perhaps I should just learn to quote my arguments correctly -- if it wasn't for the fact that this type of thing is standard all over our software. It is one hell of a turbulent stream of never-ending little issues that together keep this stuff from being trustworthy. I have the greatest mental tool known to mankind on my desktop, but I can't trust it to cut-and-paste properly? We've so many old legacy problems acting as rocks in the stream that our users are constantly drowning in rapids of our own making. And so often, all we ever do is tell them that they shouldn't have added that extra forward slash; "everybody knows you can't have extra forward slashes"; and we wonder sometimes why so many people hate computers?

A related theory is that you can judge the overall complexity of a 'machine' by the size of the effects caused by changes to its small details. In a simple system, small changes have only a small impact, but as the overall complexity grows chaos becomes more prominent so that small issues have progressively larger and larger impacts. The inherent complexity of the problem leads us to a base change effect of a specific size, but with every increasing dose of artificial complexity those effects become magnified. In a big system if you can draw some conclusions about the size of the initial problems vs. the size of their effects, over time you might get some gauge on how quickly the artificial complexity is accelerating within the system. This is important if you are interested in how many more changes you could make to the existing system without applying some type of major cleanup before it becomes likely that the changes will become impossible to make. Once it has passed some threshold, no significant change can be made to an overly complex system without it causing sever and damaging side-effects.

UPDATE: I'm not even going to try to put words to the amount of frustration I just experienced trying to move this piece between Google Docs and Blogger. The formatting, anchor tags, are not happy; consider yourself lucky if you can even read this...

Wednesday, December 19, 2007

The Nature of Simple

The phone rang; it was so long ago that I hardly remember it. It all comes back in that hazy historic flashback where you're never really sure if those were the original facts or just ones that got inserted later.

It was another support call. I was the co-op student back then; one of my duties was to handle the very low volume of incoming calls. For some software that might not be hard, but in this case, handling support for a symbolic algebra calculator was anything but easy.

This latest call was interesting. A Ph.D. candidate working on an economics degree was asking questions about simplifying complex equations. All in all, they were not hard questions to understand, not like the ones where I end up struggling to answer questions on how to relate some obscure branch of mathematics back into something that could be computed. Both ends of the question, the mathematics, and the computer science were often fraught with details and complexities of which I had no idea. An average question would send me back towards the university searching for anyone who could at least inform me of the basics. In those days, we didn't have Wikipedia, and certainly no World Wide Web. The Internet existed, but mostly as universities connected together with newsgroups and FTP sites. If you wanted to find out about something, that meant hitting the library and scouring through books in the old fashion way. So much easier today. But it does take some of the fun out of it.

The core of this specific set of economics-related questions had to do with simplifying several complex microeconomic models so that the results could be used in a textbook. The idea was to use the symbolic algebra calculator to nicely arrange the formulas to be readable. The alternative was a huge amount of manual calculations that opened up the possibility that they might be handled incorrectly. Mistakes could creep into the manipulations. Automating it was preferable.

Once the equations were inputted into the calculator, the user could apply a simplified function. It was fairly easy. The problem, however that prompted the phone call was that the simplified results were not the results that were expected.

In the software, there were lots of parameters that the user could control and some nifty ways to get into the depths to alter the overall behavior, but time has washed all of those details from my mind. What I do remember is that after struggling with it for an extended period of time, both myself and the grad student came to realize that what the computer intrinsically meant by 'simplifying' was not really the same as what any 'human' meant by it. We -- through some less-than-perfect process in our thinking -- prefer things to not be rigorously simplified. Instead, we like something that is close, but not exactly simple. Our rules are a little arbitrary, the algorithm in the software wasn't.

Some things hit you right away, while others take years of slowly grinding away in the background before you really understand their significance. This one was slow; very slow. Simplification, you see is just a set of rules to reduce something, but there is more to it than that. You simplify with respect to one and only one variable. There is a reason for that.

SIMPLIFICATION MADE EASY

In any comparison of multi-dimensional things, we cannot easily pin one up against the other. Our standard way of dealing with this for two dimensions -- for example -- is to take the "magnitude" of the vector representing some point on a Cartesian graph. That is a fancy way of saying we take the two dimensions and using some formula -- in this case magnitude -- apply it to both variables and then 'project' the results onto a single dimension; one variable. In two dimensions we like to use the length of the vector, but that choice is somewhat arbitrary. Core to this understanding is that we are projecting our multi-variables down to one single measure because that is the only way we can make an objective comparison between any two independent things. Only when we reached two discreet numbers, a and b, x and y, 34 and 23, can we actually have a fully objective way of saying that one is larger than the other.

Without a measure, you cannot be objective so you really don't know if you've simplified it or made it worse. In any multi-variable system of equations, you might use something like linear programming to find some 'optimal' points in the solution space, but you cannot apply a fixed simplification, at least not one that is mathematically rigorous. Finding some number of local minima is not the same as finding a single exact minimum value.

Simplification then is an operation that must always be done with 'respect' to a specific variable, even if that 'variable' is actually a 'projection' of multiple variables onto a single dimension. For everything that we simplify, it must be done with respect to a single value. For any multi-variable set, there must be a projection function.

Falling back outwards -- way away from mathematics -- we can then start to analyze things like that fact that I want to "simplify my life". On its own, the statement is meaningless because we do not know in regards to which aspect in my life I would like to make easier. If I buy more appliances, to reduce my working time -- say a food processor, for example -- I save time in chopping. But it comes at the expense of having to buy the machine, washing it afterward, and occasionally performing some type of maintenance to keep it in working order. It simplifies my life with respect to 'chopping time', but if you've ever tried to wash a food processor you may know that it takes nearly as long to get it clean as it does to actually chop something. For a 'single chop' it's not worth bothering about, only when there is a great deal of cutting, does it really make a difference. If you used it every time you cooked, it would hardly be simpler.

If we look at other things, such as a computer program, underneath there are plenty of variables to which we can simplify with respect to them. Overall, the nature of our simplification depends on how we project some subset of these variables down onto a single measure. This leads to some interesting issues. For instance, the acronym KISS, meaning "keep it simple stupid" is used frequently as dogma by programmers to justify some of their coding techniques. One finds in practice that because the number of variables and the projection function are often undefined, it becomes very easy to claim that some 'reduction' is simpler -- which it is in respect to a very limited subset of variables -- when in fact it is not simpler in respect to some other projection based on a larger number of variables. We do this all of the time without realizing it.

FILES VS. FUNCTIONS

My favorite example of this comes from an older work experience while working on a VMS machine. Back in those days, we could place as many C functions as we wanted into individual files. One developer, while working, came up with the philosophy of sticking one, and only one function into each file. This, he figured was the 'simplest' way to handle the issue. A nice side effect was you could list out all of the files in a directory and because the file name is the same as the function name, this produced a catalog of all of the available functions. And so off he went.

At first, this worked well. In a small or even medium system, this type of 'simplified' approach can work, but it truly is simplified with respect to so few variables that we need to understand how quickly it goes wrong. At some point, the tide turned, probably when the system had passed a hundred or so files, but by the time it got to 300, it was getting really ugly. On the older boxes, 300 things in a directory scrolled hideously across the screen; you needed to start and stop the paging on the terminal in VMS to see what was in the directory. That was a pain. Although many of the functions were related, the order in the 'dir' was not. Thus similar functions were hard to place together. All of this was originally for a common library, but as it got larger it became hard to actually know what was in the library. It became less and less likely that the other coders were using the routines in common. Thus the 'mechanics' of accessing the source became the barrier to prevent people from utilizing the source, which was rapidly diminishing the usefulness and work already put into it.

Oddly, the common library really didn't have that many things in common, not with respect to any library that we often see today. 300 functions that actually fit into about twelve packages was fairly small by most standards. But the 'simplification' of arranging the files enhanced the complexity of other variables, causing the library to be useless because of its overall extreme amount of complexity.

By now, most developers would easily suggest just collapsing the 300 files into 12 that were ordered by related functions. With each file containing between 10 and 30 functions -- named for the general type of functions in the file, such as date routines -- navigating the directory and paging through a few files until you find the right function is easy. 12 files vs. 300 is a huge difference.

Clearly, the solution of combining the functions together into a smaller number of files is fairly obvious, but the original programmer refused to see it. He had "zoomed" himself into believing that one-function-per-file was the 'simplest' answer. He couldn't break out of that perspective. In fact, he stuck with that awkward arrangement right up to the end, even when he was clearly having doubts. He just couldn't bring himself to 'complicate' the code.

SIMPLE DRIVING

This post was driven by my seeing several similar examples lately. It is epidemic in software development that we believe we are simplifying something when we are actually not. Not that we should confuse how humans like simplifications that are not as rigorous as mathematics. It is, however, that people tend towards simplifying with respect to far too few variables, and then they willfully ignore the side-effects of that choice.

I frequently come across documentation where someone is bragging about how simple their solution is when clearly it is not. That files vs. functions problem pops up its ugly head all over Computer Science in many different forms. Ultimately, lots of people need to get a better understanding of what it actually means to really simplify something. That way their code, their architectures, their documents, and all that they present will actually be better; not just simplified with respect to some artificially small set of variables; causing all of the other variables in the solution to become massively more complex.

When you go to simplify something, it is important to understand a) the variables and b) the projection function. If you don't, then you are simplifying essentially randomly. The expected results of that will be: random, of course. If you've spent time and effort to clean up your code, then having that come crashing down on you because you didn't see the whole problem is rather painful. Something to be avoided.

SIMPLICITY AND COMPLEXITY

To simplify something is actually quite a complex operation. To do it with many many variables takes a great deal of processing. Given a universe where the possible variables are essentially infinite, it actually becomes an incredibly challenging problem. However many of us can do it easily in our minds; it is one of the mysteries of human intelligence. With that in mind I saw this quote:

"Don’t EVER make the mistake that you can design something better than what you get from ruthless massively parallel trial-and-error with a feedback cycle. That's giving your intelligence much too much credit." -- Linus Torvalds.

To be fair this quote was preceded by a discussion of how the human body was the most complex thing we know. The trial-and-error feedback that is referenced is the millions of years that it took nature to gradually buildup the mechanics of our bodies. We should note that the feedback cycle has left its mark in terms of artificial complexity, particularly in funny things like our tail bones. We did have tails, so it is not hard to understand why we have tail bones, but we don't have them anymore. Also, the parameters of the feedback loop itself were subject to huge variations as many environmental and geological factors constantly changed over the millions of years. Those disruptions have left significant artificial complexity in the design. We are far from simple.

But it is not unlikely that as we gain more of an understanding of the underlying details, we are able to visualize abstract models of the human body and all of its inner workings in our minds. After all, we do that for many things currently. We work with lots of things at a specific higher-level of abstraction so that we are not impaired by the overall full complexity of their essence.

Eventually, one day we will reach a point where we understand enough of the biomechanics that a single human could draft a completely new design for the human body. For that design, I fully expect that they would be able to eliminate most of the unnecessary buildup of complexity that nature had previously installed. I give our ability to think abstractly a great deal of credit. But more importantly, I give any system of variable changing adaptations, such as evolution, even more, credit in its ability to create artificial complexity.

Applying this type of thinking towards things like software development can be interesting. Many people believe that a huge number of software developers working together on a project in a grand scale will -- through iteration if done long enough -- arrive at a stage in their development that was superior to that of just one single guiding developer. Their underlying thinking is that some large scale trial-and-error cycle could iterate down to a simplified form. If enough programmers work on it, long enough, then gradually over time they will simplify it to become the ultimate solution. A solution that will be better (simpler) than one that an individual could accomplish on their own. Leave it to a large enough group of people for long enough and they will produce the perfect answer. This I think is incorrect.

We know that with some large group of people, each individual will be 'simplifying' their portion of the overall system with respect to a different set of variables and a different projection function. Individual simplifications will collide and often undo each other. The interaction of many simplifications with respect to different variables is the buildup of complexity at the overlapping intersections. In short, the 'system', this trial-and-error feedback loop, will develop a large degree of artificial complexity deriving directly from every attempt at simplifying it.

It gets worse too. Unless the original was fairly close to the final simplified version, the group itself will drive the overall work randomly around the solution space. It is like stuffing hay into a burlap sack with lots of holes: the more you stick it in one side, the more it falls out the other holes. You will never win, and it will never get done. Such is the true nature of simplification. We've certainly seen this enough in practice, there are lots of examples out there in large corporations. In evolution, for instance, huge numbers of the earlier less optimal models get left lying around. Nature isn't moving towards a simpler model, its just creating lots of them that are very adaptive.

Oddly, only one human -- who can squeeze into their brain -- a large enough number of variables and process them, is the only real way into which we can find something that is close enough to the simplest version. But just one person. I once had a boss who adamantly claimed that "nothing good ever came from a committee", and in many ways because of the 'blendedness' of the various concepts of each of the members his statement rang very true. However, it just might be more appropriate to say "nothing simple ever came from a committee". If we want it to be simple in our human space or the mathematical one, it needs the internal consistency of having just sprung from one mind and preferably at one time. Once you get too big for a single human, it needs to be abstracted at multiple levels in order to fit properly. Strangely our knowledge itself gets built up into many layers of abstractions in a never-ending feedback cycle, but one in which only human intelligence can take advantage.

FINAL SUMMARY

Ok, time to wrap it up. We talk a good talk about simplifying things, but fairly often we are actually unsure of what that really means underneath. It is easy to wear blinders or get one's head fixed on a specific subset simplification, but it is very hard to see it across the whole. In that way, the term itself: 'simplify' is very dangerous. We use it carelessly and incorrectly when we do not affix it to a specific variable. We do not simplify things, we simplify them with respect to a specific measure. Once that is understood, you see why the road to complexity is paved with so many attempts to simplify problems; it is an important key to understanding how to build reliable computer software programs.

My final thought is that if the number of variables for use in simplifying something is infinite -- you can use one of those Turning inspired proofs about creating meta-variables to prove this -- this means that from a universal perspective, you could never get a 100% projection together. So this implies that ultimately you cannot absolutely simplify anything. There is no universal simplification. It just doesn't exist, possibly proving the statement "nothing is simple". Neat huh?

Friday, December 14, 2007

Fixing your Development Environment

In any work environment there are always circumstances that are a little less than perfect. However, if your job is not completely awful, then there is still potential. You just have to alter your environment enough to make it closer to your desires.

Some people have the pessimistic view that they are simply peons; it is all beyond their control. They must live with things the way they are. The truth however, is that each one of us has a huge influence over our own working environment, whether or not we really understand it. For this entry I want to examine some of the things that software developers can and should effect in their work places. We get out of our work what we put into it. Positive changes make work easier and make you feel better about the time you spend working.

Obviously, if you not the boss you can't just leap out and make huge sweeping changes. That doesn't mean you won't have any effect, it just happens slowly. Patience and perseverance are necessary. Continually making gradual small changes is the key to success. We always need to pick our battles very carefully, fixing each problem one by one. Nothing changes if you give up, accept the problems and then live with them. All your potential goes into reinforcing the status quo instead of improving it.

Wherever I go, I always start with trying to correct those problems that are closest to me. I like to start nearby and work my way out; getting the simple things fixed quickly, then gradually tackle larger and larger issues. This keeps up a positive momentum of successful changes.

For any rogue software development environments, the first big problem usually encountered is missing or poor source code control. Given the number of good software packages available for handling this, it is still a surprise to see development shops that don't use anything. Honestly, if you are not using any source code control, you are not seriously programming; you're just playing around. It's not an optional tool. It's not a pain. It always pays for itself.

This issue is much closer to home than actually designing or coding. Chaos with handling source code is far worse than bad source code. Fix this first or pay heavily for it later.

Set up a working scheme that protects the code, makes it easy to build and handles any secondary issues like sharing one code base for multiple instances. For complex projects expect a complex setup. In some cases the source code control software already exists but is being used poorly. Clean it up; fix any of the control problems. For very large teams using several hierarchical pools at different levels may be the best solution. You should be getting stability, reliability and traceability from your setup, anything less should be fixed.

The next most significant problem with most development is automating the build, release and update procedures. Depending on the length of each iteration in the project, manual procedures can be very expensive. Too much time is wasted and the results are inconsistent. Generally these days an IDE like NetBeans or Eclipse or a tool like Ant or Maven make automating the build procedure simple and painless, but few developers move on to the other two major automations.

Packaging for releases and upgrading are big issues, which should be fully automated to make sure there are no silly problems slipping in at the last moment. Inconsistent releases are embarrassing. For clients updating their systems, nothing looks worse than a long series of hard to follow instructions; more than one step is too many.

One-button releases that give the users one-button upgrades are the gold-standard of a development project. They are always possible, but few people invest enough effort to get there, choosing instead to waste more of their time in dealing with manual inconsistencies. You sleep far better at night if there is less to worry about, and messing up the final bits in a manual release is a frequent enough problem. You just need to get beyond your fear of spending too much time on this type of automation, it does prove itself to be worth it in the end. At minimum, for each release, allocate a fix amount of time -- say two weeks -- to trying to automate the big three processes. Gradually, you'll get it working, but even if its only at 80% it still pays off nicely.

Moving out a little bit, we often find in most development positions that we get responsibility for a previously existing part of the system. This 'legacy' problem is a pain, but it doesn't have to be horrible. A great way to learn the code and fix it at the same time is to apply a series of non-destructive refactorings to it. Generally, one wants to compress the code somewhat, but also handle the functionality in the most consistent manner. If you find that there are three ways in the code to handle something, harmonize on one. Fixing the consistency is necessary, breeds familiarity and will save lots of pain later. Updating the comments is another way to get through the code, making positive but non-destructive changes.

The worse part of any development job is flailing away at bad code; no matter who authored it. So it is important to make sure that all of the code you can change is simple, clean, consistent and readable and that any notes on understanding, fixing or extending it are are self-contained. The time we put into code formatting is minimal compared to the time we spend agonizing over problems. The two balance each other out; you'll always do less debugging if you start with better initial code. This is always true; for any system; for any type of code. Debugging is a very important skill to have, but it is something to be avoided wherever possible because it is very time consuming with little payoff.

If you've managed to get your environment under control, your code is clean and it is working reasonable well, then it is time to move on to the system as a whole.

If your project is large enough, there are several developers, some of whom are having problems with their own development pieces. Assisting another developer without getting in their way is a very tricky skill, but another one worth learning. Most development problems stem from sloppiness, so if they are having trouble with their logic the likely culprit is messy code. The most basic help is to gently convince them that a little more discipline goes a long way. Clean it up first, then debug it. Generally this is more of a problem for younger less experienced developers and high-stress environments.

Sometimes the problems come from trying to build something sophisticated without fully understanding the underlying model enough. We often have to jump into development years before we fully grasp the problem, but once the code is set -- even if it is obviously broken -- it becomes hard for most developers to admit their initial approach was flawed. They would rather struggle through with bad code than toss it and start over again. In a complex problem, if you don't have a simple clean abstraction, any amount of artificial structural fiddling will obscure the base structure, making it very hard to fix. Most developers will err to the side of making their implementations more complex than necessary. All of the extra 'if' and 'for' statements cloud the code and make it unstable. It is like stuffing hay into a sack with lots of holes; as you stuff it in one side, it falls out the other. It never ends. Refactoring could work, but only if it goes deep enough to really change the model. A clean rewrite of only the 'core' code is usually the best answer.

When sloppiness or overcomplexity aren't the main problems, overloading variables with several independent values is also very common. Splitting the information into two or more separate variables generally leads to a working solution. Its a relativity painless refactoring, which should be a standard skill for all programmers because overloading the meaning of their variables is such a common problem.

For any of these programming issues, the best assistance comes from getting the other programmer to use you as a sounding board for their code. As they try to explain how it works to you, they often see the error of their ways and how they should correct it. Expressing the solution verbally is a powerful tool. That, consistency, debugging and refactoring are the absolutely required skills for any good developer.

Looking further outwards at the bigger picture, we draw 'lines' in the code by breaking off the dependencies between the pieces; separating them into independent sections of code. If there is a clearly defined 'interface' that one piece uses to access the other piece, then there is a line between them. A system with few or no lines at all is a huge mess, usually referred to as a big ball of mud.

Balls of mud are a common reason that most large corporate in-house systems fail to work properly. 250,000 lines of code with no structure is a nightmare waiting to be deleted. Not to worry, one can add architecture after-the-fact, although it is more work than getting it right initially. The general idea is to draw enough lines in the system, partitioning the code into horizontal and vertical pieces. Breaking the program down into processes, threads, modules, libraries, components, sections, layers, levels, etc. draws important features into the architecture. The more pieces you have -- if they are not too small, and encapsulated within reason -- the better the architecture. When you can easily explain how the system works to another developer in a simple white board diagram in 5 minutes with a couple of diagrams, then you've probably drawn enough lines into the code to be useful.

Another reason to draw lines in your code is to swap out bad algorithms. If you know the code is doing something poorly, refactor it so that the whole algorithm is contained together in a single component or layer. Once you've isolated it, you can upgrade to a newer version with a better algorithm. Beyond the time for the work that is involved, it is boring yet not hard to do. But you need to pick your target selectively or you'll spend too much time refactoring code without enough effect. As an example, switching from coarse-grained locking to fine-grain locking in a web system, means burying all of the lock calls behind an API, then harmonizing the API arguments, then just swapping it out with a newer locking version. All of the related code needs to take and release locks, what happens underneath should be encapsulated away from the caller.

As a complete produce or tool, most systems are seriously in need of consistency and good graphic design. Consistency you can do yourself, and its always worth the effort. Your users will definitely notice and usually if you have contact with them, they will thank you. Hiring a graphic designer and an editor always pay for themselves in terms of the final appearance of the system. Too many software developers think their own poorly designed systems actually look good; users don't complain because it is not worth the effort and they doubt it will ever get fixed; so why waste the time.

If you've gotten this far in changing your working environment, the system is pretty good, so we can move on to bigger issues. This time we go to the users.

We build tools for our users. Our overall direction of the development is to build better tools. This means increasing your 'real' knowledge of the problem domain. Programmers always think they know more than they actually do, so being wary of this will lead you to start spending more time with the end-users to understand what they are actually trying to accomplish with their tools. Keep in mind that we may understand the type and frequency of the data, but that doesn't mean we really understand the problem. Complains are especially important because they highlight some part of the system that needs to be fixed. Rarely are the users actually wrong, it is the system that is stiff and hard to us. It is likely that you could find some tool that better fits the user's needs if you search harder enough for it. Just don't fall into the trap of completely trusting the user's own analysis of their own needs. They know what they want to do -- they are the experts on that -- but they don't really don't know which tools would best suit them, that should be your expertize.

If we come to understand how to build our system for the user, the next step is to branch out into creating more tools that solve their problems. There is an endless degree of trying to compact more and more advanced functionality into an underlying system without actually denigrating it. Too often the professional products 'arc' in their features, reaching a point where successive releases are actually worse than the their predecessors. It is a very complex problem to combine several tools together under the same 'roof' in a way that actually improves the functionality; just smashing them together is weak. Given that so many of our tools are stuck in silos, with poor ways to interact with each other, this is a huge problem in the industry that needs to be solved. Too much time is wasted by importing and exporting data through various different silos, when the computer as a tool is powerful enough to handle all of this for us transparently.

Building your own code is fun when you start, but as you've been developing longer and longer the real challenge comes from trying to direct larger and larger groups to build bigger systems. Bigger tools demand bigger teams. Three guys building a system is a very different arrangement problem than trying to do it with two hundred. Honestly, I know of no large shops of programmers that actually produce great or good works. Small teams craft the initial versions, then large teams slowly grind the code base into uselessness. Even if they have good people, a large team bends towards fugly code. We don't currently have any other methodologies or processes to get around this; it is another huge open problem in software development.

If there are many development efforts on-going, then trying to change them from giant brute-force exercises in re-writing the same code hundreds of times, into a leveragable abstract foundation for building a large series of complex tools is both a technical problem and a personal management problem. Technological issues are fun because they are isolated, but the really challenging problems come from mixing humans and technology; whether it is running the system or just building it. Master this and you've really mastered software development; just knowing how to whack out a couple of 'for' loops doesn't count.

That's enough of a general direction to get anyone from a simple development project into a grand plan to re-develop all of the companies software projects. As you can see, there is an awful lot of simple things that we can do to effect our environment and make it better. Patience is necessary, but we can all do it. You can make enough small changes so that work is is interesting, challenging, but not frustrating. Although it seems contradictory to what many of those evil management consultants are implying, work can be fun, rewarding and you'll actually get more done for your company. A positive workplace is always more effective.

Thursday, December 6, 2007

Pedantically Speaking

Spending your days programming a computer -- if you do it long enough -- will start to have a noticeable effect on your personality. Not that it is a big surprise, one's profession has always -- no matter how complex or simple -- gradually morphed their personality. If you think that all of the accountants you've met are basically the same, you're not that far off base.

Programming is a discipline where we bury ourselves in the tiniest of details. Perfection is necessary, at least to the degree that your work won't run on the computer if it is too sloppy, and it is exceptionally difficult to manage if it is even a bit sloppy. This drives in us, the need to be pedantic. We get so caught up in the littlest of things that we have trouble seeing the whole.

Most often this is a negative, but for this post I wanted to let it loose into its full glory. It is my hope that from the tiniest of things, we may be able to draw the greatest of conclusions.

Originally, the thoughts for this blog entry were a part of a comment I was going to make for another entry, but on reflection I realized that there was way too much of it to fit into a simple comment field. This post may be a bit long winded with lots of detail, but it is the only real way to present this odd bit of knowledge. You get some kinda award if you make it to the end in one piece.

I was going to be rigorous in my use of a programming language for any examples. Then I thought I might use pseudo code instead. But after I had decided on the opening paragraph, I figured that the last thing I actually needed was to be more rigorous. If I am going to be pedantic then the examples should be wantonly sloppy. Hopefully it all balances out. The best I can do for that is to mix and match my favorite idioms from various different programming languages. The examples might be messy -- they certainly won't run -- but hopefully beyond their usefulness there should be minimal extra artificial complexity. Straight to the point, and little else.

For me the classic example of encapsulation is the following function:

string getBookTitle() {
    return "The Programmer's Paradox";
}

This is an example of cutting out a part of the code into a function that acts as an indirect reference to the title of a book. The function getBookTitle is the only thing another programmer needs to add into their code in order to access its contents. Inside, encapsulated away from the rest of the world is a specific instance of 'book title' information that references a specific title of a book. It may also happen to be the title of a blog, but that is rather incidental to the information itself. What it has in common with other pieces of similar information out there is possibly interesting, but not relevant.

The title in this case both explains the underlying action for the function -- based on the verb: get -- but it also expresses a categorization of the encapsulated information: a book title. It may have been more generalized, as in getTitle, or less generalized, as in getMyBookTitle, but that is more about our degree of abstraction we are using then it is about the way we are encapsulating the information.

In a library, if the code were compiled and the source wasn't available, the only way other programmers could access this information is by directly running this function. This 'information hiding', while important is just one attribute of encapsulation. In this case my book title is hidden, but it is also abstracted to one degree as a generic book title. Used throughout a system, there is no reference to the author or any necessity to know about the author.

Now this "slice and dice" of the code is successful if and only if all of the other locations in the code access the getBookTitle function not the actual string itself. If all of the programmers agree on using this function, and they are consistent in applying that agreement, then we get a special attribute from this. If you consider that this is more work than having the programmers just type in the string each time they needed it, then that extra work should be offset by getting something special in return. In this case, the actual title string itself exists in one and only one place in the code, so it is a) easy to look up, b) easy to change consistently for the whole program and c) easy to verify correctness visually. All strong points that make this more than just a novel approach for coding.

So far, so simple. Now consider the following modification to the function:

string getBookTitle(string override) {
    if(override is NULL)
        return "The Programmer's Paradox";
    else
        return override;
}

This is a classic example of what I often refer to as partial encapsulation. While the base function is identical, the programmer has given the option to any other programmers to override it with their own individual title. Not a realistic example, you say? Well, if you consider how many libraries out there allow you to override core pieces of their information, you'll see that this little piece of code is actually a very common pattern in our software. In a little example like this, it seems silly to allow an override, but it is done so often that it always makes me surprised that people can't see that what they are doing is really isomorphic to this example. It happens frequently.

The beauty of encapsulation is that we can assume that if all is good in coding-land and everyone followed the rules, that we can easily change the book title to something else. If we rebuild then the impact of that change will be obvious and consistent. More great attributes of the work we did. The problem with partially encapsulating it however, is that these great attributes are no longer true. We no longer know were in the code programmers used the original string and where they chose to override it with one of their own. While we could use a search program like grep to list out all of the instances of the call and check each one manually, that is a lot of extra time required in order to fully understand the impact of a change to the code. At 4am in the morning, that makes partially encapsulating something a huge pain in the ass. We lose the certainty of knowing the impact of the change.

With the exception of still having the ability to at least search for all of the calls for the getBookTitle function, the partially encapsulated version is barely better than just having all of the programmers brute force the code by typing in the same string each time, over and over again. This is still one advantage left, but consider how many great attributes we lost by opening up the call. If it was just to be nice to the calling programmers and give them a 'special' feature then the costs do not justify the work.

We can move on to the next iteration of the function:

string getBookTitle() {
    file = openFile("BookInformation.cfg");
    hash = readPropertiesFromFile(file);
    closeFile(file);
    return hash{"BookTitle"};
}

Now in this example, we moved our key information out of the code into a location that is far more configurable. In a file it is easy for programmers, system administrators and sometimes the users themselves to access the file and change the data. I ignored any issues of missing files or corrupt data, they aren't relevant to this example, other than to say that the caller high up the stack is concerned about them. The BookTitle functionality is really only concerned with returning the right answer. Presumably the processing stops if it is unavailable.

When we push out the information to a file, we open up an new interface for potential users. This includes anyone who can or will access the data. A configuration file is an interface in the same way that a GUI is one. It has all of the same problems requiring us to check all of the incoming data in a bullet proof manner to make sure we are protecting ourselves from garbage.

More to the point, we've traded away our BookTitle information, which we've encapsulated into the file definition in exchange for the location of the file itself and the ability to correctly parse it. We've extended our core information.

In the outside world, the other programmers don't know about the file and if we are properly encapsulating this information, they never will. If we open up the ability to pass in an overriding config file, we are back to partially encapsulating the solution, but this time it is the config file access info itself, not the original title info that has been opened up.

Now the BookTitle is an accessible file that even if we set read only it is hard to actually determine if it was edited or not, so in essence it is no longer encapsulated at all. It should be considered to be publicly available. But, even though its actual value can no longer be explicitly controlled, its usage in the system is still fully encapsulated. The information is no longer hidden, but most of the encapsulation properties still hold. We have the freedom to handle any value, but the encapsulation on how it is used within the system. The best of both worlds.

Now we could consider what happens when we are no longer looking at a single piece of information, for example:

(title, font, author, date) = getBookInformation();

In this case, the one function is the source for multiple pieces of related information for a book. This has the properly of 'clumping' together several important pieces of information into the same functionality that in itself tends towards ensuring that the information is all used together and consistently. If we open it up:

(title, font, author, date) = getBookInformation(book_identification);

Then a programmer could combine the output of multiple different calls together and get them mixed up, but given that that would take extra work it is unlikely to end up being that way in production code. Sometimes the language can enforce good behavior, but sometimes all you can do is lead the programmer to the right direction and make it hard, but not impossible for them to be bad.

In this last case, the passed in key acts as a unique way of identifying each and every book. It is itself an indirect reference in the same way that the function name was in the very first example. Assuming that no real information was 'overloaded' into the definition of the key itself, then the last example is still just as encapsulated as the very first example in this post.

Passing around a key does no real harm. Particularly if it is opaque. If the origin of the key comes from navigating directly around some large data source, then the program itself never knows either the key or the book information.

It turns out that sticking information in a relational database is identical to sticking it in a file. If we get the key from a database, along with the rest of the information then it is all encapsulated in the calling program. There are still multiple ways to manipulate it in the database, but if at 4am we see the value in the database we should be safe to fully assume that wherever that key is used in the code it is that specific value, and that property is also true of all other values. We have enough encapsulation to know that the value of the data itself is not significant.

That is, if I see a funny value on the screen then it should match what is in the database, if not then I can make some correct assumptions about how the code disrupted it along the way.

There are many more examples that are similar, but to some degree or another they are all similar to encapsulation or partial encapsulation. We just need to look at whether or not the information is accessible in some manner and how much of it is actually accessible. Is all comes down to whether or not we are making our jobs harder or easier.

So far this has been fairly long piece, so I'll need to wrap it up for now. Going back over it slightly, we slice and dice our code to make it easier to build, fix and extend. Encapsulation is a way in which we can take some of the information in the system and put it out of harm's reach so that our development problems are simpler and easier to manage. Partial encapsulation undoes most of that effort, making our work far less productive. We seek to add as many positive attributes to our code as possible so that we can achieve high quality without having to strain at the effort.

We can construct programs that have little knowledge of the underlying data beyond its elementary structure. For that little extra in complexity, we can add many other great attributes to the the program, making it more dynamic and encapsulating the underlying information. The best code possible is always extremely readable, simple and easy to fix. This elegance is not accidental, it must be built into the code line by line. With the correct focus and a bit of practice programmers can easily build less complex and more robust systems. PS. I was only kidding about the reward, all you'll get from me is knowledge. Try the shopping channel for a real reward...