Sunday, November 29, 2009

Spaghetti Code

And now back to our regularly scheduled programming :-)

I think I've manged to get past my prime fetish for a while (although I'm still intrigued by Goldbach's conjecture). Fun stuff, but clearly not good for one's popularity. At least somebody said my diagrams were pretty ...

Continuing on from my earlier post:

http://theprogrammersparadox.blogspot.com/2009/10/ok-technically-im-under-influence.html

My second set of laws was:

Fundamental Laws of Spaghetti Code:

1. You can't understand it just by looking at it.
2. Repetitive code will always fall out of synchronization.
3. The work will always gets harder and slower as it progresses.
4. Fixing one bug creates several more.
5. You can't enhance the code if you are spending all of your time fixing the bugs.

But first, some comments from the earlier post. Astrobe commented:
Spaghetti code:
I think that characterizing "bad code" is interesting only if one says by which process the original code "degrades" or "rusts". For instance:
"backwards compatibility is a poison for software"

Hans-Eric Grönlund also commented (on a later law, but still relevant here):
"2. Well-structured software is easier to build."

While I can't argue with the above "law" it makes me think. If well-structured software really is easier to build then why isn't everybody building it so? Why is so much of our code, despite good intentions, ending up as "bad design"?
Could it be that well-structured software is easy to build but requires good design, which is where the experienced difficulty lies?

I figured for this post I could dig deeply into my notions of "spaghetti code". All and all it is actually a controversial issue, since one programmer's spaghetti approach can be another's set of well-applied principles. And vice versa.


JUST A NUMBER

Back in school, in one of our computer theory courses, we were taught this useful concept of assigning a "number" to various different programs. In the days of statically compiled code that was pretty easy to accomplish because the number was just the bits of the code itself.

We can interpret all of the bits in an executable as one giant string representing a fairly huge number. A massive number, but still just a number.

Now, if we take the uniqueness of a program to mean that it handles a specific set of inputs by producing a specific set of outputs, and that two programs are identical if the inputs and output are the same, this leaves us with what are essentially "groups" of programs that perform nearly identically to each other.

From the outside, the programs all do the same thing, but they do not necessary execute the same statements in the same order to get there. Nor do they use the same amount of resources (CPU, memory, etc.)

Simplifying this a bit, for any set of user functionality there are an infinitely large number of programs that can satisfy it. Infinitely large? Yes, because there is always code you can add to a program that doesn't change the results, but does force the program itself to be larger (unless your compiler is super-intelligent) and more complex.

For every possible program, we can construct an infinite number of larger programs that do exactly the same thing, but use up more space, memory, disk, etc. without providing any extra functionality.

Within this group of programs, some of the larger numbers are clearly going to be obfuscated for absolutely no reason. They may contain more instructions, but the instructions contribute nothing towards the overall functionality. There are an infinitely large number of these variations.

Interestedly, something similar is probably true about some of the smallest versions (numbers) of the code as well. Way back I wrote about simplification:

http://theprogrammersparadox.blogspot.com/2007/12/nature-of-simple.html

Our human perspective of simple is not the same as the computers. We don't actually prefer the "simplest" version at any specific level, but rather something that has been well-balanced as it has been reduced.

For example, simple could come from the code being contained in one great giant function or method, instead of being broken down into smaller well-managed calls. It's simpler with respect to the overall size of the code (and the run-time), but certainly not decomposed into something more manageable for a human to understand.

Because of that, for many of these groups of programs, the smallest numbers are likely simplified in ways that are not easy for us to understand. They've not desirable versions.

So, along this range of numbers, representing programs, we find that if we were looking for the best, most reasonable, version of our program, it would be located somewhere towards the middle. The smaller code lacks enough structure, while the bigger code is probably arbitrarily over-complicated; just a tangle of spaghetti. We need something balanced between both sides.


MIDDLE OF THE ROAD

If we are interested in a a reasonable definition for spaghetti code, it would have to be any code were the internal structure is way more complex than necessary. Where the complexity of the code, exceeds the complexity required. Most of the problems lay at the structural level.

Almost by definition you can't mess with the individual lines themselves. Well, that's not entirely true. You can use something like a pre-processor to put a complex second layer over the primary language one.

You can also use the language features themselves to make any conditions grossly complex, or use some crazy number of pointer de-references, or any other syntax that ultimately amounts to the programmer being able to "hide" something important in the line of code. It's not the syntax itself, but what it obscures from the processing that matters.

Now, in general, hiding things isn't necessarily a bad thing.

We do want to "encapsulate away" most of the complexity into nice little sealed areas where we can manage it specifically and not allow it to leak over into other parts of the program.

However, the success of what is being done, is related absolutely to what we are trying to hide in the code. That is, if you hide the "right stuff" (coming up from a lower level), then the code appears far more readable. If you hide the "wrong stuff", then it becomes impossible to tell exactly what is going on with the code.

The different "levels" in the code need to be hidden from above, but they should be perfectly clear when you are looking directly at them. Everything in the code should be obvious and look like it belongs, exactly where it ended up.

We don't want it to be a giant tangled mess of instructions. To avoid being spaghetti code, there must be some reasonable, consistent structure overlaid, that does more than just add extra complexity.

The last thing programs need is unnecessary complexity.

Design patterns are one technique that are often abused in this manner. I saw a post a while ago about how one should replace most switch statements with the strategy pattern.

It's a classic mistake, since by adding in this new level of complexity, you shift the program into one of the higher numbers in its range. But there have been no significant gains in any other attribute (performance, readability, etc) in order to justify the change. If no useful attributes come out of the increase, then the code has passed over into spaghetti territory. It is arbitrarily more complex for no real reason.

The motivation for using the strategy patten in this manner is simply to have a program that is deemed (by pattern enthusiasts) to be more correct. The reality is just more complexity with no real gain.

It's a negative effort, but oddly a common approach to programming. It turns out that one of the seemingly common characteristics of many programmers is their love of complexity.

We're a bunch of people that are always fascinated by small complex, inter-working machinery, so it is no wonder that the second greatest problem in most people's code is over-complexity (the first is always inconsistency, the third encapsulation).

Most programs contain far more machinery than is actually necessary to solve the problem. Many contain far more machinery than will actually ever be used at any point in the future to solve the problems.

From all of the this we can easily understand that over-complexity essentially equals spaghetti code. They are the same thing. And that spaghetti code makes it hard to verify that the code works, find bugs and extend future versions.

All in all, spaghetti code is a bad thing.


FUNDAMENTAL LAWS

Now getting back to the (possible) fundamental laws, the first one was that:

1. You can't understand it just by looking at it.

I'm not sure that's an entirely correct statement. It's more like: if you're skilled, and you should be able to understand it, but you don't then it is probably over-complicated.

Some code is just inherently ugly and hard to understand by any non-practitioners. But once understood by an experienced professional it can be managed.

The classic example is always the APL programming language. The language is a brutal collection of algebraic symbols, where the language experts use any vector processing trick in the book to avoid doing things like simple loops. If they can find some strange set of operators to accomplish their string processing for example, they're consider to be gurus in the language. If they start to use if/for like C processing logic, then they're just using the language wrong.

But it's not just complex syntax like APL. Any non-programmer would have a tough time understanding anything other than the most trivial C or Java. And they certainly wouldn't catch all the little ugly inherent subtleties in the language (like using StringBuffers and appends in Java).

If you give some C code to an experienced C programmer, while he (or she) might not understand the point or the domain aspects of the code, they certainly should be able to understand what the code is doing. And with some investigation (at the various levels in the code) they should be able to take that code and alter it in 'predictable' ways.

Even more to the point, in most languages, we have enough flexibly to encapsulate the different aspects of the logic so that virtually all of the code written in the system should look like it belongs as an example in a text-book. It should always be presented in a way that it looks like it is just a simple example of "how to do something".

Of course, anyone with experience on big systems knows that it is never the case. Sadly most of the code out there is horrifically un-readable. It's not any where close to being able to be used as an example. It shouldn't even see the light of day.

Wiith good conventions and some normalization there is no practical reason why it should be such a mess; still it ends up that way.

It's probably that our industry doesn't insist on reasonable code, and thus the code is ugly and messy. That allows horrible problems to hide gracefully, which ultimately makes it way harder to work with the code.

It's the incredibly stupid, but most common way, that groups of programmers shoot themselves in the foot.


THE PERILS OF DUPLICATION

2. Repetitive code will always fall out of synchronization.

So, by now we have all of these great acronyms like KISS and DRY, but that doesn't stop us from encoding the same thing, in multiple ways, in multiple files, over and over again. When we do this, it increases the probability that some future set of changes will cause a bug because two different "sources" of the same information that are no longer the same.

You gotten love it when someone sticks the "constants" in the code AND in the database AND in the config files. It's just an accident waiting to happen, and eventually it does.

If you consider the sum total of all of the work (code, scripts, installations, documentation, packaging, etc.), then there are a massive number of places where a relatively small amount of key information is replicated over and over again.

If it is fundamentally the same thing (even in a slightly different form) and it is contained in multiple places, and the programmers don't have the underlying discipline in process to make sure it is all updated, then at some point it will go "kablooey".

You can always trace a large number of bugs back to duplication information. It's a popular problem.


SPEED KILLS

3. The work will always get harder and slower as it progresses.

If you've got a mess, then everything you're trying to do with it either makes it worse, or it takes way longer than it should. Messes are the tar pits in which most programmers work daily.

What's really funny is how many excuses people will make for not cleaning up the code.

The favorite is invoking the term "history", as if that in itself is a reasonable justification for not cleaning up.

Another horrible one, mentioned earlier by Astrobe is "backwards-compatibility". A freakin huge amount of our unnecessary complexity has come from bad choices percolating upwards, year after year, under this banner.

Software vendors claim that their lazy-onion approaches to hacking code come from the market's need to support the older versions, but realistically there have always been "points" were it would have been acceptable to fix the earlier problems and remove the backwards compatibility.

For example, nothing has chained OSes to their own stinking tar pits more cataclysmically then their own marketing groups claiming to be backward compatible.

Apple is one of the few companies that ever really turfed their old weak and crippled OS (versions 1 to 9), to jump to a hugely different one with ten (OS X) and above.

A gamble, for sure, but proof that it can be done. Since for most OS versions you inevitable have to upgrade most software packages anyways, the whole notion of backwards compatibility being important is more myth, than good sense.

If the programmers don't fix the rapidly multiplying problems with inconsistent and bad structure, then how does one expect them to go away? Why wouldn't they make it more expensive to dance around? You can't bury your head in the sand and expect that the coding problems will just disappear. It won't happen.


CASCADING TIME

4. Fixing one bug creates several more.

The correct description for this is of stuffing straw into a sack full of holes. As you push it in one side, it pops out the others. It's impossible to finish. A hopeless task.

Anytime a group of programmers generates more bugs then they fix, it is either because they don't have the necessary experience to understand and fix the code, or that the code itself is hopelessly scrambled.

In the first case, there is a significant number of programmers out there that will leap off quickly and start working on code that they do not have the pre-requisite knowledge to understand. It is endemic in our industry to think we know more than we do, to think that we are smarter than we actually are (and user comments don't help).

Still, I've seen enough programmers try to tackle issues like parsers or resource management or even multi-threaded programming without having anything close to enough understanding.

It probably comes from the fact that within the domain itself, we are certainly not experts, but also not even novices. We get a perspective of what our users are trying to accomplish, but particularly for complex domains, we could never use the software ourselves to meet the same goals.

I guess if we learn to become comfortable with our weak domain knowledge, that transfers over to our approach towards our own technical knowledge.

The crux, then, is that a group of programmers may be expending great deals of time and effort trying to fix some multi-threaded code, but to no real great effect. They just move the bugs around the system.

In truth, the programmer's lack of knowledge is the source of the problem, but even as they acquire more understanding, the code itself is probably not helping.

And that comes back to the second case.

If the code is structurally a mess, then it is very hard to determine, with only a small amount of effort, whether or not the code will do as advertised.

If the code is refactored, and cleaned up, then its functioning will be considerably more obvious. And if it is obvious, even if the programmer is not fully versed on all of the theoretical knowledge, they have a much better chance of noticing how to convince the code to do what the programmer desires.

Bugs, and how they related to each other become strong indicators of development problems.


TIME MANAGEMENT

5. You can't enhance the code if you are spending all of your time fixing the bugs.

The single biggest issue with code is time.

The time to write it, the time to debug it, the time to document it, the time to package it, the time to distribute it, and the time to correctly support it. Time, time, and less time drive all programming issues.

And as such, it is always so hard to hear programmers making excuses about how they don't have enough time to fix something, when it's actually that something that is eating up their time in the first place.

Strangely, for a group of people who like detail, we tend not to noticed the cause of our troubles too often.

Time is such a big problem, that I think that's why I always find it so disconcerting when I hear industry experts pushing processes that require large amounts of time, but don't return much in exchange. It is easy to make up a process to waste time, it is hard to get one to use it efficiently in a series of reasonable trade-offs.

But we seem to be suckers in wanting to find easy, and fun solutions, that don't respect the fact that we can't afford the luxury of taking our time.

Programming is at least an order of magnitude more expensive than most people are willing to pay for. It survives because the total final costs get hidden over time, and by a sleazy industry that knows that truth is certainly not the best way to sell things. Software sales and consulting teams often rival our worst notions of used-car salesmen.

So the real underlying fundamental question that every programmer has to always ask about every new process, and every new technology is simply: "does using this process or technology pay for itself in terms of time, or not?"

If it's not helping, then it is hurting.

One popular example these days is unit testing. Somehow, in some strange way, a bunch of people have taken a rather simple idea and sent it off on a crash course.

In the end, we should test most of the system in its most complete, most integrated state. Testing a partially built system, or just some of the pieces is not going to catch enough of the problems.

Testing is very limited, and based on guess-work, so it certainly is an area where we need to be cautious with our resources.

In the past I have built a couple of really great, fully automated regression test environments to insure that the attached products had exceptionally high quality. Automated testing assures that, but it is expensive. It's basically a two for one deal. You write twice as much code, and that will give you a small bump in quality. Expensive, but if that is the priority on the project, it is a necessary cost.

Still, if the automated tests are being run against the full and final integrated system then they conclusively prove, for the coverage of the tests, that the system is working as expected. You can run them, and then release the system.

A unit test, on the other hand is a very low-level test.

In the past, I've used them for two reasons: one is when the permutations of the code make it unreasonable to do the test manually at the integration level, but the code is encapsulated enough to mean that the unit test is nearly identical with an integrated manual one.

That is, I wanted to save time by not having a person test it, and I wanted the test to really permute through all of the possible combinations (which a person is unlikely to do correctly). 

The other times I've used unit tests are as scaffolding to work on some particularly algorithmically nasty code, that's been having a lot of trouble.

It's really just a  a temporary bit of scaffolding to make debugging easier.

Generally, once the code has stabilized (over a few releases) I nuke the tests because I don't want them laying around as possible work sink holes. They did their job, now it's time they get cleaned up. You put up scaffolding, with the intention of taking it down someday.

Unit tests are just more code in the system that are going to be wrong, need fixing, must be updated, and take time. Time that I'd rather apply to areas that need more help. To areas where we can be more successful (like cleaning up the code).

If the choice is: a) write a unit test, or b) refactor a mess, then between the two, I've learned that the second one has a way higher probability of paying off.

But in our industry, the first one has become a newly popular process. As such a lot of programmers have been attracted by it because they would rather write something new, than actually fix what they have. The avoidance of work appeals directly into most people's weaknesses.

Since no one can see that the code is a mess, and it is much more fun to write some new (but not too complex) code nstead of fixing the real problems, programmers choose to add new stuff. And the new stuff eats up more time than fixing the old stuff would have.

In the end, any poor use of time, gradually over the course of a development starts to show up badly. Our industry says that two thirds of all of our projects fail, which means that the majority of programmers don't actually have any real clue on how to build and release systems reliably.

My experience in working in many different organizations has shown that to be mostly true (although some large companies often support completely impractical projects, and that leaves the programmers thinking that they're success rates are far higher than they actually are).

With this law, I might have been more direct and said something like: recovering time lost to maintaining spaghetti code should be the first priority of any development project. It's a sink for wasted time, and it shouldn't be.

Once we figured out how to refactor (mostly non-destructively), we had the tools available to get our code bases in order. It's a critically bad mistake to not fix code early on. The longer it lingers, the more odorous the onion gets, the more resources that get tied into testing, support, patching, covering up and apologizing for the code.


SUMMARY

We all know spaghetti code is bad, but it still shows up frequently in our systems. It is everywhere. In fact, one might guess that most of our code is grossly over-complicated in some easily fixable way. Mis-used of techniques like Design Patterns only make it worse.

Hans-Eric Grönlund was really close in his earlier comments. I do think it is  actually easier for programmers to build over-complicated code, than it is for them to build simple and elegant code. That accounts for the difference in volume between the two.

I think in general, it is way easier to be a bad programmer than a good one, and that most people choose the bad route (whether or not they know it).

Few humans are capable of hammering out minimally complex versions of something. It's not how we think. Even incredibly experienced gurus have to go back, again and again, to simplify their calls and make their work consistent. The first pass is always the brute force one. And the first pass is often an extremely over-complicated one.

We get bad code because, left on our own, we build bad code. And left again on our own, we'd rather do anything else, then cleanup our own mess. So the code stays around, and people just build worse things on top of it.

We are also an industry of extremists.

The Web always has someone, somewhere, taking some idea and pushing it way past its point of usefulness.

A good idea in moderation can become awesomely stupid when applied too pedantically. The machinery we build doesn't need to be complex, we don't want it to be complex, and complex isn't going to help.

So why do people keep dreaming up these crazy ideas that only obfuscate the underlying solutions? Why do programmers keep flocking to them?

Some of it, as I wrote earlier is about simplifying with respect to a different set of variables, but a lot of it is probably about ego.

Programmers have big egos, and they like to work on the hard projects. So, even if their task isn't particularly hard, they'll do everything in their power to make it as difficult and hard as possible. It's a common weakness.

In a similar manner, many people believe that writing big long complex sentences with lots of obscure terms make them sound smarter for example. But the truth is (and always has been), that the smartest people are not the ones that take a simple problem and make it appear complicated. They are instead the ones that make all of their solutions look as simple and trivial as possible. Ultimately that's what makes them famous, makes them stand out from the crowd. Being able to obfuscate something isn't intelligence, it's just patience, and an inability to see the larger picture.

The same is true in the programming world. The systems that survive are the ones that do their job as simply and painlessly as possible (that's why main-frames are still popular, they're ugly but straight-forward). The really complex stuff generally gets replaced fairly often (or never used). But there are just too many people in the industry trying too hard to prove that they are as smart as they think they are. And as such, we are awash in an endless sea of bad code.