Saturday, April 25, 2015

Special Case Considered Harmful

Without a doubt, the all-time quickest way to implement any new variability in an existing code base is to create a new branch with if/else statements, copy & paste the original code into its new block, and then modify the copy. You don't have to spend any time thinking. It's quick. It's initially painless. And it is one of the most popular ways to create code.

As a result of this, we often see programs that consist of layer upon layer of nested branches; sometimes four levels deep, sometimes worse. Advanced programmers avoid this by hiding these special cases in a mass of methods or functions.

Although the code itself gets larger, the testing of conditionals is usually cheap so there aren't any real performance penalties associated with this type of code. It may not be optimal, but it's not really slow either.

Testing does require more effort; forgetting to check the functionality of a newly added special case can easily mean that it gets released accidentally. If the branches are deeply nested, it can take considerable effort to unwind all of them them into a complete list of test cases. Most often that's not actually done, leading to bugs.

Extending the code is where the worst hit is taken. A few branches are not hard to deal with, but as they quickly build up, it becomes harder and harder to guess their behaviour just by reading the code. With lots of test cases necessary to cover the whole range of behaviour, it also becomes harder to understand when using a debugger. Thus you can't read it and it's no easy task to walk it either.

As the difficulties mount, the uncertainty grows -- probably exponentially -- until any new change to the code has a much better chance of being wrong then it does of working properly. At this point, since the code can no longer be safely changed, other worse options are usually selected.

At the root of this very common problem are special cases. If there is only one way in the code to handle the all of the data, then everything goes through the same logic and there is no need for any branches. The code is almost trivial.

Still, most people find it easier to solve big problems by breaking off a subset and then handling it in a 'special' way.

It's taught to us offhandedly in that we learn to solve problems by decomposing them. That's fine, except when that decomposition cuts across deeply related problems. Breaking up everything into special cases might be an effective way to understand how to go about solving the pieces, but the second half of that approach -- that isn't taught -- is to bring it all together again for the whole solution.

Special cases are often highly favored in many domains, thus the domain experts and analysts are always a significant source of them entering into the project. Big systems get built by going after the low hanging fruit first, then picking off the lessor cases one by one. This insures that progress is visible right from on onset of the project.

Thus the proliferation of special cases is fast, easy and we're often led to believe that it is the correct way to handle complex problems. That this is then mapped directly into deeply nested code or a quagmire of functions is not really a surprise. But it doesn't have to be that way.

One of the most visible aspects when reading any elegant code is that it does not have either an excessive number of branches nor too many loops. The code is organized, the flow through it is clear and the special cases few or non-existent. Clarity comes from the level just above the instructions, done well it keeps the logic from spewing off into a myriad of spaghetti-like flows.

From those rare examples, we can easily conclude that minimizing branches is both good and definitely worthwhile if we are intent on having the code continue to expand. The follow up questions then are "how?" and "what's the minimum number of branches?"

Long ago, when I first started programming, there was always a lot of effort put into simplifying coding logic. The earlier languages offered little guidance in structure, so the programmers had to spent more time back then, trying to work out appropriate ways to keep it from getting out of control. Object Oriented programming, and indirectly design patterns, embedded some of that organization directly into the languages themselves, but at the expense of the newer generations of programmers no longer understanding why well-structured code was important. The emphasis shifted to gluing stuff together as fast as possible. That might work for consolidating clumps of existing technologies, but it's an unmanageable nightmare when applied to a sea of special cases coming in from the domain side.

What's necessary is to avoid implementing special cases wherever and whenever possible. And to cleanup some of the existing mess, it is critical to merge back any existing special cases into the mainstream. Fortunately special cases are easy to spot since they exist as branches, and they are easy to remove, you just get rid of the if/else statements and merge the code together.

This does require some thought and practice, but it is not rocket science and it becomes easier to do as you gain more experience.

In the really easy cases, you can drop a branch when the difference between the two underlying blocks of code is only a variable or two. That is, you create new variables, assign them values at the beginning and then perform the sequence of actions, passing in the variables where appropriate. In the worst case, you might have to go back aways through the code, adding back the variables right up until the data loaded. I'll get back to that approach later.

For some cases, the difference is actually contained in a larger set of variables. This is handled the same way as individual variables, but with a structure or object instead of a single variable. Keeping related variables together and building up complex structures to pass around is a strong approach to organizing any code.

To get rid of some cases, the lists of executed instructions are slightly different in each case. There are some subtle ways of handling this, but they usually involve either table-driven programming or an intermediate data-structure. The idea is that you build up data that lists the instructions, varying by each case, as you go. Once the structure is complete, you walk through it executing everything. Simple lists fit into statically created tables, while more complex logic can be dynamically created as lists, trees or even DAGs. You'll find this type of code at the heart of interpreters, compilers, editors, graphics programming and some really beautiful domain applications. The if statements are implicitly hidden because of the way the structure was constructed. This is a powerful way to eliminate special cases and to control complexity, while still making it almost trivial to extend the code. For static tables you just add in new rows, for intermediate representations you just add more functionality to manipulate the structure.

That of course leads to employing polymorphism in OO languages. The special cases disappear because everything is handled by overrides of basic primitive methods that are specific to each class. That is, the blocks in the if/else statements form the methods implemented in the objects. Then you get a 1:1 relationship between the smallest blocks needed and the primitives necessary to solve the problem. Set up correctly, expanding a polymorphic hierarchy requires as little thinking as creating deeply nested code (which generally means you have to be careful) but the end result is still organized. To go a bit farther, you can even reuse common code blocks via inheritance, which saves on typing and testing.

In particullary messy systems where there are an explosive number of permutations that all might need to be handled someday, polymorphism is a really strong approach. Even before OO we'd implement the same mechanics in C by using table-driven function pointers that implemented the primitives. Then to change processing for different sets of data or configuration all you needed to do was update the table with a new set of function pointers, thus making it easy and reliable to extend to new cases. Setting up that type of organization doesn't take much time, but it saves massive amounts later as the code grows.

Sometimes the special cases are very very hard to remove. For that we have to go up to abstraction, and then manufacture virtual elements to smooth out the logic. This is a tricky technique to explain in general terms but the basic idea is to add in things that doesn't really exist, but if they did then there wouldn't be any special cases remaining. A very simple example of it exists in the "The Art of Computer Programming" where Donald E. Knuth documents a couple of  algorithms for linked lists that always have a head node (circular and doubly linked). That is, the list is never empty, even if there is nothing in it. This simple trick reduces the logic down to something quite manageable and provides several neat little micro optimizations. It's a very small "space vs. time" trade-off that's unfortunately been mostly forgotten.

I've had to use this abstraction technique many times in the past to simplify difficult, yet performance dependent logical flows. The different cases are tantalizingly close to each other, but they don't actually fit together. It can take a bit of creativity and patience, but eventually you stumble onto some virtual piece that ties it all into one. The key is that the code actually gets simpler, if it doesn't then you need to toss it immediately and try again. Almost always it has taken a couple of iterations to get right, but in the long run it has always been worth it.

If you look carefully, you'll see this technique in all sorts of interesting places, but because it rounds out the code so nicely it is easy to miss. Still, if you really want to write something sophisticated, it is an indispensable part of the your toolkit.

Sometimes branches are just unavoidable, but that doesn't mean they have to gum up the code. Like limiting scope in variables, you can push a branch down to the absolute lowest point in the code. It wraps only the absolute minimum instructions that differ. Doing that is often enough to get it encapsulated away from the rest of the complexity. Another strong way to handle stubborn branches is to move them to the earliest part of the code where the data comes in. It has always been a good idea to only convert and verify data 'once' upon entry from an external source, thus avoiding needless fiddling and redundant checks. This is also a great place to add in variables that avoid branching. Placed there, the branch can set up a variable that flows cleanly through the rest of the main logic.

Most special cases are unnecessary. There are the very few rare ones that are completely unavoidable, but long before accepting any new special cases into that category all effort should be made force fit them into the mainstream properly. Special cases match the way people think and they get created quite naturally from various domains, programmers and the analysis process. That is to be expected, so it is up to the system's designers and its programmers to merge them back into as few cases as possible. Although it's easy to just whack out more special cases in a code base, it is one of the fastest and most dangerous forms of technical debt. Intrinsically each one is bump up in disorganization. It's also mindless, and in software anytime something isn't thought about explicitly, it usually manifests itself into a problem. We got into software development because we like to solve problems by thinking deeply about them, oddly trying not to think deeply about them is counter to what most people claim they enjoy.