Thursday, June 26, 2008

Readability

You don't have to hang around many large software development projects before you realize that most of them are crippled by their own code base.

The biggest symptom is how widely the differences in coding style vary are from all of the collaborators.

All programmers have different styles and preferences, but teams that are not trying to work together lead to heavily siloed code; code that is redundant and inconsistent. A bad situation that only gets worse with time.

The projects that work the best do so because (most of) the team is in harmony, going at it the same way, and reusing the same basic infrastructure. What I like to call "Several Species of Small Furry Animals Gathered Together in a Cave and Grooving with a Pict" after the well known Pink Floyd song.

While it is such a significant indicator of problems, and represents a long-term issue, it is not easy to get a group of programmers to work together in a consistent fashion. There are lots of reasons why that is true, but for this particular post I want to only focus in on one of them.

Readability, for those that like things to be clearly defined in the beginning, is the ease in which some unknown 'qualified' person can examine the code in question and make a reasonably correct judgment as to its function. The code is easily readable.

This is a key attribute in code, in that the most elegant answer to the problem is the simplest one that meets all of the necessary constraints, and is implemented in a readable manner.

That is, readability is the second half of making something elegant. It does you no good to have the perfect solution if its implementation obscures the fact that it is the perfect solution. The problem, in that case is not fully solved.

All of that is rather simple in it's own way, but as we dig deeper into tying to pin down the details we hit on a really nasty question: is readability subjective?


SUBJECTIVITY

If you are going to get a team of people to follow a standard, it is far more difficult if that standard is arbitrary. You lack convincing arguments to justify it.

Programmers in particular, love to be unique, so without a good argument they tend to fall back onto their own styles, ones that suit their personality and cultural bias. Good for them, bad for the team.

This means that you need strong arguments to sway a team into following similar courses, and the argument "because I say so" is usually a bust with any group of even mildly contentious or cantankerous programmers. The strongest possible answer you could have is that there is a "provable" right answer which is not debatable.

At this point, I could dig into the essence of subjectivity and form some type of formal proof for saying that readability is or isn't subjective. But I won't. I leave the reasoning behind that for the very end of this post. You'll have to get through the other bits first before I can explain it.

Readability, in its essence is a form of simplification. You can examine it on a line-by-line basis, but just because the individual lines themselves are at their most readable, doesn't mean that the set of them together is particularly readable. This is one of those problems where we must consider the whole as well as the parts.

My instinct suggests that there is some element of subjectivity inherent in the definition of readability, since as a simplification there is no objective way to constrain all of the variables that may or may not make something readable.

More so, what is readable to one person is not necessarily to another, and standards, conventions, culture and style change the readability.

Even more interesting, the encompassing technology itself changes it, as one notices the change in style brought on tools like syntax-colored editors and larger screen real estate.

And of course, the implementation technology itself "bounds" the readability. Highly readable APL for example -- a horribly cryptic language of symbols operating on vectors and matrices -- is only "so" readable. COBOL on the other hand is overly so.

But, and this is a huge but, if you fall back and look at a number of textbooks for branches of mathematics like calculus, you will find a remarkable degree of similarity amongst the various proofs for theorems.

In that context, the proofs have to be rigorous, and the readers are always students, so the details need to be simple; these specifics drive the authors to find proofs that meet those criteria. These proofs are similar to each other.

In all ways, proofs are a type of code example, being nothing more than an algorithmic series of steps that must be followed to verify that sometime is rigorously true.

Those proofs then are isomorphic to computer language programs, which themselves are just sequences of steps that result in some change to a data-structure.

The degree of similarity between different versions of the same mathematical proof is important. The degree of similarity between proofs and code is also important.

This shows, that while there may be some subjectivity, it is barely so. The differences between two proofs are most likely to be trivial or near-trivial. The same is true for "readable" code implementing the same algorithms in the same consistent manner.

Yes, it is subjective, but not enough to meaningfully effect the results.

Indirectly my favorite example of this comes from Raganwald, where Reg Braithwaite writes about narcissism in coding:

http://weblog.raganwald.com/2008/05/narcissism-of-small-code-differences.html

My particular view of that post is that the agnostic was focused on a simple reliable way of getting the job done. A good, simple readable answer.

The other three, the ascetic, the librarian and the purist, all took turns at applying their own "simplifications". The problem, was that in their quest to simplify the code with respect to their individual biases, they introduced their own additional complexities. Unnecessary ones.

All three other pieces of code were consistent with their author's beliefs, but that was a trade-off made against the code's readability. And, in this specific case, since we don't know the rest of the system context, the trade-off is a poor one.

Worse though, is the fact that the "team", as it were is flip-flopping inconsistently between styles. Any of the overly righteous code might be obfuscated, but at least if it were consistent it wouldn't be that hideous. A mixed team is far worse than a purist one.


SIGNIFICANCE

Readability being negligibly subjective is a huge statement, that if inherently correct has significant ramifications for software developers.

If you're writing an introductory book on programming, you spend a great deal of time fiddling with the examples to bring out their visual sense. The readers should be able to quickly look and see that the code is for a linked list and that it functions in such a manner.

Why, would it be any different for writing a full system? Is it less likely that there will be other readers of the code? No, we know that if the code is good, lots of people will view and edit it.

Is this a one-shot deal, it is written and then cast in stone? No, the code will likely go through many iterations before it is even mildly stable.

Is is faster to write or execute ugly code? Definitely not, simple code usually performs better.

To some degree, in a simple programming example, because of the size, some of the detail has been dumbed down. But, in that way, while the detail should not be necessary for the example, it can also be abstracted away from the actual production code to leave the results more readable. Industrial strength code shouldn't be that far away from a good textbook example.

Sure, it involves some extra work, but that is a short-term effort that saves significant long-term trouble, and gets easier as you do it more often.

If you are serious about development, then you want to write elegant code. In all ways, that is what separates a professional software developer from just any old domain-specific expert who can type: they can write it, but its ugly and fragile.

The high end of our skill is in not just getting it to work, but also getting it done elegantly, so that it continues to work for release after release, revision after revision. Doing it once is nice; begin able to do it consistently is the great skill.

Then as a team, is it not obvious that coming together to build an elegant solution is a best practice? That said, no matter how hard you agonize about the code, you fall short if its implementation is just plain weird. And if you're not following your team-mates, then you're being just plain weird, aren't you? (weird is relative, after all)

The full strength of this argument is that someone -- anyone with a minor amount of technical experience -- should be able to look at the code and get a sense of what it is for. As such, if they look at it for hours and shake their head, it is more than likely the code's fault than the reader's. And that, in the overall sense is critically important to being successful with the project.

If the code's meaning shines clearly through the awkward syntax, it is good, but if not, you're doing it wrong.


THE BEST EXAMPLES

There are two simple properties to look for in really great code:

- the code "looks" simple.

- the code does "not" match what the code is actually doing.

The first one is obvious, but it is actually very hard to achieve. Some programmers get 'sections' of clean code, but to put forth an entire system that is clean simple and obvious is a skill beyond most current practitioners today.

Partly because they've never tried, but also partly because turning something simple into something hard is easy, going the other way is a rare talent.

You know you are looking at good code when it is easy to completely misgauge the amount of effort that went into it. If you figure you could just whack that out in a couple of days, then because it reads that simple, it is clearly readable.

The second property is harder for many programmers to understand.

If you have abstracted the nature of the problem and have thought long and hard about how to generalize it, then the implementation of the code will be at a higher level then the problem itself.

The upper implementation will be smaller, more configurable, more optimized, run faster, use less resources and be more easily debugged than having just belted out all of the instructions. A good abstraction is a huge improvement.

But it is a curve: go too far and you fall off the other side, it becomes convoluted. If you don't go far enough and you have huge bloated code that is rigid and prone to bugs.

At that very peak of abstraction, or somewhere in that neighborhood, the essence of the algorithms and data-structures are generalized just enough, only to the degree to leverage the code for its widest possible use. But in that, it (the code) deals with and references the generalness of the data.

So, for instance you implement the corporate hierarchy as a multi-branch tree. The code talks with, and deals with the abstract concept of a tree, while the real-world problem is a corporate hierarchy.

You could "name" the variables and routines after the hierarchy, but that would be misleading if you choose to use the code elsewhere, so the real naming scheme for the bulk of the code should follow the level of generalization. Tree code should talk about trees.

Once you've implemented the tree code, the you can bind an instance of that code to any other hierarchy functioning in the system. The "binding" refers to the hierarchy, but the nuts and bolts underneath still refers to the trees.

But of course, finding that higher level of generalization and abstraction is just the first half in creating an elegant solution. The second half falls back onto that first principle of making it look simple.


DIGGING INTO THE DETAILS

Readability is huge, and there is some bias for reader, culture and technology. However, I'll ignore all of that, and concentrate on a few simple examples.

I've tried to stay away from clearly subjective issues, but some readers will no doubt be entrenched in using particular syntaxes, styles or conventions.

As a very big warning I need to say that not all things that have made it into our best practices are such, and just because "everybody" does it that way doesn't make it right.

If you've found that I've over-stepped the line, the best thing to do is make a list of all of the pros and cons, and give them some weight. Most issues are trade-offs, so you need to consider what you lose as well as what you gain. A one-sided approach is flawed.

As well, I don't claim to be perfect, and I'm not always a rigid fan of currently popular approaches like "pure" object-oriented. I'm agnostic, I am will to try anything, but I want those solutions that are simple and really work.

A perspective that seems to grow with age and experience. I've lost my fascination with all of the little gears and dials in a clock, all I want know it to know is what time it is.


LINE BY LINE

For an individual line, it is most readable when its purpose with respect to the surrounding code is immediately clear, within the context of its technology. That's fairly easy to ascertain.

An assignment, for example, in most common programming languages is equally readable no matter what the syntax. The technical mumbo-jumbo that surrounds the statement is a necessary complexity for expressing the instruction.

Simple statements, on their own are simple and readable. Combining multiple statements/operators/methods on a single line opens up the door for making it unreadable.

So for example, in a language like Java, the syntax allows one to chain method calls, but it does so in a backward fashion from reading from right to left. If, the normal syntax is left to right, and suddenly that changes, it is an easy "asking for trouble" indicator:

object.bigMethod(argument, subobject.getNext().leftChild());

is not particularly readable. Chaining can be good, but not mixed in to normal syntax.

Lines of code should be simple, clean and obvious. Each and every line has but one purpose. If you've crossed multiple purposes into the same line, you've got a mess. If the arguments to your functions are syntactically complex, then you're just being lazy. Break it out into simple consistent lines.


EXCEPTIONS

At first try-catch blocks seemed like a blessing. That was, until they started showing up everywhere. What you get very quick is two distinctly different programs overlaid onto the same call structure. Two of anything in code is always bad.

And far worse is multiple deeply embedded try-catch blocks in the same method. It might be there to be "specific", but it threatens to be overly messy.

If you need a philosophy, catch the stuff you want to ignore at a low level (encapsulate it into its own functions please). Embed any expected normal results right into the return data. Then put one big global catch at the top to split out everything "exceptional" to log files, email (if needed) and the screen (with full or partial detail depending on the operational circumstances). Lots of little, tightly scoped low handlers and one great big one for everything else.

Only truly exceptional things should use exceptions (see "The Pragmatic Programmer" by Andy Hunt and Dave Thomas). If you expect it, then it should be part of the normal program flow.

Just say no to complex error handling madness. It will only make you cry at 4am.


LOOPING BLOCKS

It gets a little messier when we consider more complicated syntax. For example, with loop constructs in Java we can look at the followingsyntaxes:

for(Iterator it=object.iterator(); it.hasNext();) {
Object value = it.next();
...
}

and:

Iterator it=object.iterator();
while(it.hasNext()) {
Object value = it.next();
...
}

This is a good "block-level" example because it's not necessarily obvious.

From one perspective the 'for' loop is worse than the 'while' loop because it is 'clever' to abuse the syntax by not using a third argument. Clever is never good. Also, each line itself is simple.

But from a broader perspective the following are also true:

1. The declaration and loop construct all fit neatly into one line (and they reference the same underlying things).

2. The scope of the 'it' variable is limited to the scope of the loop block. (in the while, it is scoped for the whole function).

3. Traditionally 'for' loops are used for 'fixed' length iterations, 'while' loops are used for variable conditional iterations. E.g. we usually know in advance the number of times the loop will execute in a for loop, but we don't know for a while loop. That makes it easier to quickly scan code and draw some assumptions about how it works.

In most cases, the "iterator" is really just syntactic sugar for:

for(int i=0;i<container.size();i++) {
Object obj = container.get(i);
...
}


which is a fixed length traversal of a container. The iterator encapsulates the index variable, the size and the get call in one single object (hardly worth it, but that's another topic).

4. The new Java 5 for-each syntax allows:

for(Object obj: container) {
...
}

but I think they should have changed the loop name to 'foreach' like Perl to make it more obvious ;-)

5. It's more conventional to use 'for' loops for iterators in Java?

Thus for a wider range of reasons, the 'for' syntax is more readable than the 'while' one. Someone scanning the two quick is more like to be less confused by the 'for' loop.

The biggest reason for me is #3, that a simpler loop would do just as well. In a sense, there is a precedence between the different constructs. Never use a 'while' loop, when a 'for' loop will do. Never use a 'do-while' loop when a 'while' loop will do. Save the exotic stuff for exotic circumstances.

All functionally equivalent syntaxes should be seen as having a precedence. In that way, you should always gravitate towards the simplest syntax to get the job done.

In many languages there might be multiple ways to get things done, but you have to restrain yourself to consistently using the most simplified one. Just because you can, doesn't mean you should deliberately hurt yourself.


CONDITIONALS AND LOOPS

The whole point of subroutines, functions or procedures, was to allow the programmer to break off some piece of code for the purpose of making it more readable. Re-using it in multiple places in the program is a nice side-effect, but it is secondary to the ability to isolate and encapsulate some specific set of "related" instructions into a well-named unit.

At its very most basic level, one could advise newbie programmers to make a new function at the point of every conditional or loop in the program. Yep, I am saying that you might put each 'if', 'for' and 'while' statement in its own method.

That perspective is extreme of course, but if you start there, and then refactor backwards combining things together into the same functions because they are related -- in the way that one combines sentences together in the same paragraph -- then you won't be far wrong from having a clearly normalized call structure.

And if, the arguments at each level are all similar, and the vocabulary of the variables in each similar function are also similar, then the overall structure of the code is nearly as clean as one can achieve.

Thus, if you have some huge routine with lots of conditionals and loops, the larger it is, the more you have to justify not having broken it down into smaller units. It happens, but usually only for complex processing like parsing, or generation.


WHOLE FILES

The acronym Don't Repeat Yourself (DRY) is well intentioned (see "The Pragmatic Programmer" by Andy Hunt and Dave Thomas), but repeating oneself is not necessarily the real problem.

The reason we don't want to repeat things is because the same information is located in two different places at the same time. As the code moves over time, the odds that both instances are changed correctly decrease rapidly. Thus if you have essentially the same detail in two different spots in the same program, you are just asking for trouble, and trouble is what you will always get.

A huge way around this is to apply encapsulation to bring the details together into the same location.

So, it's not a matter of not repeating yourself, it is more like "don't separate the repeated parts across the program". Bring them all together in one place. If you have to repeat, then do it in one location, as close together as possible.

But it is not just identical data, it is any 'related' or similar data for which it must stay in sync.

That, and only that is the driving nature behind packing things carefully into files. Any code that should be together when you examine it, should be placed together when you build it.

If the compiler or interpreter can pre-determine the correctness of the code, then it is the most technically strong answer. If you can verify it visually by looking at the line, then it is still quite strong. If you can see the whole context at once and verify it, that is also great.

If, in order to verify a line of code, you need to jump around to a dozen files, cross-reference a huge number of fields, and perform some other herculean acts, the code is weak, and likely to have bugs.

So, an element of very bad code is such that in order to make a simple and obvious change, you need to jump around a huge amount to a large number of different files.

Yes, I am saying that as a consequence of this some well dispersed object-oriented code is not particularly readable, and as such is not the most elegant solution to the problem. This is nothing inherent in object-oriented that it promotes or creates readable solutions. It's just a way of chopping things up. There are, some types of programming problems that are more readable when implemented as object-oriented, but there are also some problems that become horribly more obfuscated.


WRAPPING IT UP

We so fear telling each other that there is a cleaner, more readable way to implement that line of code. But as a group working together, I think it is hugely important to have these types of discussions and to really spend time thinking about this. A good development team will choose to follow the same conventions.

You can't spend too much time trying to get everyone on the same page, because the consequences involve way more effort. Wrong code doesn't get re-written fast enough, gumming up the works and causing maintenance headaches which eat into available resources.

We never move fast enough to fix our early mistakes, so we drown in our own mess.

Somehow culturally we have to get beyond our own insecurities and be able to at least talk to each other in a polite manner about the small things as well as the big.

If readability is to any significant degree widely subjective, then any differences are just opinion. That would be bad, in that it leaves the door open to massive inconsistencies and the many assorted stupid problems cause by them. Two programmers never share the same "opinion", but they might share the same "right" practice.

Not that I really want to suggest it, but in many ways, the most correct answer for readability is too presume it is universal; setting it up as a myth. The curious may seek and possibly find the truth, but hopefully along the way they would come to understand why a convenient myth is better. Sometimes we just have to choose between being right, or getting the team to work correctly.

I do know, directly from experience that it is horribly difficult to sit there with other programmers and nitpick their code for little violations of readability. But, and that can't be stressed too heavily, those small inconsistencies are far from trivial. They are in all ways little "complexity" bubbles that are brewing and forming, and getting larger all of the time. They are the start of the trouble that may potentially wreck the project. They are not trivial, even if they are small.

UPDATE: I think that it's just "readability", but someone has done an excellent job in codifying an approach towards mostly making code more readable, and called it 'Spartan Programming'. It's an excellent approach to normalizing code to make it readable and the examples are absolutely worth checking out:

http://ssdl-wiki.cs.technion.ac.il/wiki/index.php/Spartan_programming

Although, in this case, I do think that the overly-short variable names go too far and instead increase the base complexity. Acronyms, codes and mnemonics are far more 'complex' then their full un-abbreviated names. Longer is better, so long as it's not too long.