Wednesday, December 26, 2007

A Silly Little Storm

This posting looks into a small and common problem, but one that frequently causes larger issues. Even the tiniest of problems, which starts as a small rock in a stream disrupts behind it an increasingly larger segment of water. This type of turbulence exceeds the sum of itself as more rocks combine together to cause larger and larger ripples. With enough rocks you get rapids; the whole being much larger than its pieces.

Sometimes when I am commenting on other writer's blog entries I just want to end my comment with a little bit of shameless self-promotion. Usually my name and then a link back to my blog to make it easy for readers to stop by and visit me:

Paul.
http://theprogrammersparadox.blogspot.com

These are my favorite last two lines in most comments.

Of course, to avoid mistyping the URL I generally go back to my blog's root page, highlight the URL, copy it and then paste that into the comment box a couple of times. Pasting decreases the likelihood of typos; usually a good thing.

Someone pointed out that in Blogger I should wrap the URL in an anchor tag so that it is easier for any readers to click on the link. No problem. That seems easy enough. The simplest way to accomplish this is to type in the starting tag '<a href=', paste in the URL, close it with '>'; then paste in the URL again -- to make it the visible text as well -- and finally add in the closing '</a>'. How hard is that?

In Blogger, when I go to the root page for my blog, I get an extra forward slash '/' character because it is at the 'top' level, and not pointing to any specific entry. If I selected a blog entry then the URL ends with the '.html' extension because it is directly referencing a file. But normally if I want the top level URL for my blog, I'll go to the main page and copy the URL with its extra forward slash at the end.

Funny enough, in Blogger there is this stupid little 'issue' where an anchor tag in the form of "<a href=XX />link</a>" that has an extra '/' at the end of the open tag works perfectly in the preview, but is slightly modified when actually posted. In the preview it appears as if the forward slash is ignored -- everything looks fine, which is no doubt a browser dependent behavior -- and FireFox interprets the tag as expected.

But once you post the comment:

<a href=http://theprogrammersparadox.blogspot.com/>
http://theprogrammersparadox.blogspot.com/</a>

it gets turned into:

<A HREF="http://theprogrammersparadox.blogspot.com" REL="nofollow"/>

Visually the hyperlink text disappears and any of the following text becomes the link instead. As well, the newlines get turned into breaks and there are paragraph markers inserted at the beginning and end of each paragraph.

I am guessing, but when you actually post the comment it is sent to the database for storage. On the way in or out, but usually in, the comment text is modified by some well-meaning, but obviously destructive code. That 'function' is looking to perform some modifications on the comment text; from above we see it added an REL option to the anchor tag; it also converted the tag to uppercase. Because it sees the anchor tag as unary, it seems to strip the orphaned closing '</a>' tag. It probably does lots of other stuff as well. The parsing for that function saw the close forward slash and assumed that it was part of the HTML tag syntax, unlike the browser that sees it as part of the earlier directory path syntax. The browser's parsing behavior is described as 'greedy' because it tries for the longest possible token, which in this case is the full directory path with the appended forward slash. The internal function on the other hand is probably ignoring the directory path syntax and is just looking directly for '<', '>' and any '/'s to modify them. Since it isn't fully parsing the HTML it doesn't have to understand it as well as the browser does.

So, wham, bam, I copy the URL into the text a couple of times, check it in the preview and kapow, I post it. "Oops, what happened there?" Suddenly the text and the link looks very different from the preview. If you combine both of the above issues that simple little extra '/' picked up in the URL causes quite an annoying problem, but not one that is detectable in preview. Everything looks great until you actually post. A very embarrassing error.

You may have noticed that I didn't wrap the href argument in double quotes. That certainly would have fixed the problem, but why toss in an extra set of double quotes when the preview has already shown that not having them will work properly. It is just extra stuff that doesn't seem necessary. Dropping the trailing '/' works too. But the 'problem' isn't what is actually typed, the problem is the preview. It is incorrectly verifying that stuff is working, when clearly it is not.

The cause here beyond my own sloppiness starts with two different points in history where at least two different programmers assigned two different meanings to the same underlying forward slash character: one as the separator between directory names in a path string, and the other as a modifier to the tag syntax to indicate a unary tag or an ending tag depending on placement. These two distinct meanings 'overload' the meaning of the character so that when combined in this specific circumstance the true meaning is 'nearly' ambiguous; one block of code sees it one way, while the other sees it differently. When we 'paste' together different 'idioms' that have been crafted over time, we often get impedance mismatches. Little blots of dangerous complexity.

Added to that, two different pieces of code are parsing the same HTML fragment very differently. However close Blogger's HTML parsing routine is to FireFox, it is different enough to cause problems. I could try testing this in Internet Explorer, but it doesn't really matter to me who is right and who is not following the standard. FireFox is being greedy, Blogger is not. More likely is that to greedily parse the directory path, Blogger's code would have to go to a huge amount of extra work, most of which was an unnecessary level of complexity beyond what they actually needed to manipulate the comments. They won't ever parse the HTML fragment correctly, because they do not need to go to that level of understanding.

What is important is that the same basic 'data' is being interpreted in multiple ways by multiple different programs and that while it is not ambiguous, it might as well be, since any two coders cannot and will not parse it identically. Partial parsing will always produce ambiguities. Overloading will too. The meaning of our data is so often relative to how we handle it in our code. We fail to separate the two; implicitly buring knowledge into some piece of code.

Topping it all off, some programmer implemented the preview function without fully understanding what he or she was doing. If there are 'translations' done in other parts of the system then it is exactly these 'translations' that need to exist in the preview function. Without them preview is useless. Incomplete. A 'mostly' preview function is a sloppy dangerous featurette that mainly misleads people and diminishes the tool. In the end had this been implemented correctly, none of the other issues would matter or be of any consequence. I would have seen the broken link in the preview and fixed it. Although earlier problems set up the circumstance, the 'buck' stops at the final and most obvious place.

We might quickly dismiss this type of issue as trivial -- perhaps I should just learn to quote my arguments correctly -- if it wasn't for the fact that this type of thing is standard all over our software. It is one hell of a turbulent stream of never-ending little issues that together keep this stuff from being trustworthy. I have the greatest mental tool known to mankind on my desktop, but I can't trust it to cut-and-paste properly? We've so many old legacy problems acting as rocks in the stream that our users are constantly drowning in rapids of our own making. And so often, all we ever do is tell them that they shouldn't have added that extra forward slash; "everybody knows you can't have extra forward slashes"; and we wonder sometimes why so many people hate computers?

A related theory is that you can judge the overall complexity of a 'machine' by the size of the effects caused by changes to its small details. In a simple system, small changes have only a small impact, but as the overall complexity grows chaos becomes more prominent so that small issues have progressively larger and larger impacts. The inherent complexity of the problem leads us to a base change effect of a specific size, but with every increasing dose of artificial complexity those effects become magnified. In a big system if you can draw some conclusions about the size of the initial problems vs. the size of their effects, over time you might get some gauge on how quickly the artificial complexity is accelerating within the system. This is important if you are interested in how many more changes you could make to the existing system without applying some type of major cleanup before it becomes likely that the changes will become impossible to make. Once it has passed some threshold, no significant change can be made to an overly complex system without it causing sever and damaging side-effects.

UPDATE: I'm not even going to try to put words to the amount of frustration I just experienced trying to move this piece between Google Docs and Blogger. The formatting, anchor tags, are not happy; consider yourself lucky if you can even read this...