Monday, July 20, 2009

Bugs and Misfortune

I was on the hunt again. We were in the middle of testing a web application, a medium sized one, when a couple of simple events occurred.

We had received token errors, as if we had backed up in history and then resubmitted the same form over again. The system protects against this by sending a token to the browser and storing another one in the session. On the submit request, the two tokens are compared, and the request is rejected if they are not identical.

When my companion got the first couple of token messages, I was ready to dismiss them as just possible user errors. We were after all, using several browsers each and while it was possible in IE to get each browser in it's own session, it was also possible to have two sharing the same one underneath. A situation that would easily create token errors, since they are session-based.

This changed however when I logged in from a fresh session, went straight to one of the forms, filled it out, and hit enter. Although I had a few independent instances of IE hanging about, this was a new clean session with nothing to interfere with it. Thus, the error message on screen was an immediate cause for concern.

Even so, we continued with the testing, finishing the day by getting through as many other tests as we could.

Now at this point I might still have chosen to ignore the behavioral problems. They were happening, but they were not directly tied to any specific test.

I had been at a huge unnameable company briefly, earlier in my career and I watched how they passed over some similarly disconnected issues in their testing, just because they were difficult and not repeatable. Their code was plagued with really bad threading problems, but the testing was gratuitously ignoring them. Not a mistake I wanted to make. Not an admirable software product that I wanted to create or use.

The next day we decided to focus a little harder on finding a reproducible set of steps that was causing the problem. Strangely we succeeded on our very first attempt, but then failed to get the error for the entire rest of the day. A not too subtle reminder that this type of bug can be a very annoying type to track down.


ROOTS

In my career, I had been here before. Many times. It's that bug that we know is there lurking there in the background, but forcing it to come forward is difficult and time consuming. On the other hand ignoring it means that it will pop it's ugly head up once every six months or one a year or at some other sparse interval, always coming when it's the most awkward moment. Always causing some big issue and then fading back into the background. One of the thousands of irregular, infrequent bugs that float around us on a daily basis.

I remember tracking down a stack overwriting problem in Mac OS 6. During an error, the C code grabbed values that were originally on the stack, but now that the stack-pointer had long-jumped up to somewhere else, they were getting corrupted by other programs sharing the same space. That cost a painful week in a debugger.

I remember tracking down a compiler code generation issue in VMS, where a C function with 56 arguments was causing the compiler to generate dysfunctional code. Depending on the compile options it would work, then it wouldn't, a seemingly random occurrence. That was worth three or four extremely late night sessions.

I remember guessing that some device on the network was responsible for truncating the HTTP requests and sending them multiple times, even through the network admins swore that no such device existed, at least until they found it later. I found it gradually through hit and miss in a frigid machine room, it took days to thaw out again.

And I can remember hundreds of other seemly impossible bugs, buried quietly in the code, just waiting to cause problems. The really dangerous ones were always the ones that were infrequent and irregular.

Computers are deterministic, right up to the point until they are not. At that level of complexity it can be very difficult to remember that they are simply doing exactly what they are told. Nothing more and nothing less.


MORE ISSUES

We had passed through another day of testing, with just one problem early in the morning and then nothing. The next day we conferred with another of our companions, who had been sitting out of testing because it was mostly his code changes that we were verifying.

For internal system tests, we follow the simple rule that programmers cannot system test their own work. Mostly, if someone writes the code, someone else should write the test and at least one other person should run it. In that way, the differences in perspective have a stronger more likely tendency to show up obvious problems. Problems get found faster with multiple perspectives.

For big external releases, we use independent testers, but for the smaller internal ones we mix and match as needed depending on who did what for each release. A big formal process is OK, in so long as there is a smaller faster parallel one that can be used as well. One "rule" in software development, does not fit them all.

With three of us now looking at the problem, we headed off in a couple of different directions. Our first instinct was that this problem was coming from caching some how. That was an obvious choice, since backing up in history creates the same error message. If the page were cached that would explain it. One of us headed off and started re-examining the underlying configuration of the container framework we were using to execute the code.

Another of us headed straight for the debugger, in the hopes that we could catch this problem in the act. While the third went back through the code, verifying that the usage of the token mechanics, while quite simple, was done correctly. It might have just been a simple coding issue.

After a bit, the caching issues went no where, and a well posted question on Stack Overflow showed that it was unlikely that the problem was coming from caching. Although the settings in the browser cache directory looked suspicious, the server was correctly following conventional wisdom, and setting the reply parameters correctly. It should have been working as coded.

On the other front, the debugger seemed to be paying some dividend. The code was acting strangely, appearing to execute multiple times, even though that actual physical code itself said that was impossible.

Mostly accidentally at this time, we stumbled across a way to repeatably generate the error.

In debugging, when you've finally been able to consistently reproduce the behavior, you know you have the problem beat. In this case we could consistently make it happen.

Now that we had a fixed test, we did the next logical thing. We ran a version of the test on our existing production machine, to see if this problem was in the current release. Amazingly it wasn't, but given the small number of differences between the two systems this wasn't good news.

So we did the next logical thing. We ran a version of this test against a freshly checked out development system. Up until now we had seen this error on our stand-alone test network version, and on our internal demo/test server and also in the debugger on one workstation. Oddly the freshly checked out version also failed to have the same error.

In the combination of these various different points, we had made huge progress. We had a repeatable test, and we had shown that some versions of the system had this problem and some did not. It was now really just a matter of examining all the different little bits that distinguished the different systems, and gradually working through them piece by piece until we find the culprit. A bit of work to be sure, but no longer hard work, just grunt work.

As we started to divide up the possibilities, the first most obvious piece to check was a specific "portable" run-time library that was providing a speed boost to the deployed version. Because it was easy, we undid this enhancement first, and surprisingly the problem disappeared. When we put it back the problem came back again. We un-installed and reinstalled it a couple of times to be sure. That was it. We had our culprit, the question now was why?

Once you know what the problem is, asking the right questions is a lot easier. We continued on for another hour or so, but only to really satisfy our selves that we had the whole bug, not just part of it. On occasion, several bugs will manifest themselves and look to be one, but it can be misleading. We worked through the different scenarios and convinced ourselves that the library was the genuine cause of all of our woes.


BREAKDOWN

Interestingly, it turned out that one of our development workstations was using the library, but several were not. A slight mistake in that we had no good reason to differ our development environments from testing or production, so that was a mistake on our part.

The bug in the library was causing an incoming request in a second connection from the same host to issue off two requests in the container, a most odd type of bug. The second duplicate phantom request occurred only from POSTS, and only from the first one in a session. A rare condition, so it was lucky that we caught this.

The production system was immune, because although the same configuration of the library was there, the code was just slightly different enough not to cause the other session. A small change that we were never really able to pin down exactly, but we had a couple of good guesses.

An upgrade in the underlying library fixed the issue. We were able to insure that it was gone, but just in case we also did another more extensive round of testing to make sure that there were no other similar bugs.

Had this bug gone out in production it would not have been fatal but it would have been very annoying. Users would have kept calling about token problems, and we would have, at first, kept telling them that they were user errors. It's the type of bug that makes users really hate developers, yet it's infrequency would keep making it look as if they were the problem, not the system.

In the end, this was a very small problem in the underlying library, hardly noted in it's change-log. In our system it's effect, while not fatal, was far more significant. We were very susceptible to the misbehavior of some nearly anonymous component buried deep in our system. It was a couple of layers down from our code.

It is too many years of experience with these sorts of problems that tends toward making me shy away from having too many dependencies. The fewer the better.

I know some programmers think it's wonderful to grab a billion libraries and splat glue code at them to wire up a whole massive system in a couple of days of frantic cutting and pasting. That's a nice idea, but our underlying layers are way too unstable to make it anything more than a miserable reality. I'm not interested in quickly releasing unstable applications, I'd prefer to get right down there and make it work properly. If I'm going to build it, I'm going to build it properly.

For most things, the code and the algorithms aren't even that hard, and at this point are fairly well known. If you rely on someone else's implementation, chances are it will be overly complicated and prone to difficult errors. If it's your own code, at least you know what it does and you can go in and fix it properly.

Every time I see discussions and comments about not re-inventing the wheel I figure that the author will have a change of heart once they've been around long enough to get a real sense of what we are actually doing. In youth, we want to develop fast, but as we age, our priorities shift towards wanting to produce better quality, so we don't have to work as hard. I've been down too many death marches and bug fests to not know that the leading cause of failure is almost always sloppy code. Easy come, easy go, I guess.

So many young programmers think it's about getting a release out there uber-quickly, but it's actually about get releases out there, year after year, version after version, without allowing the product to decay into some giant massive bundle of crap. Anyone can do it once, but it takes serious professionals to keep it going for many releases, and it takes serious self discipline to keep it from getting worse and worse with each release.

Still, dependencies are inevitable in anything we write. The expectation for software is sophisticated enough that we no longer have the luxury of being able to build all of it ourselves. And, as a consequence, one very important skill in software development will always be debugging complex problems.

It's the first skill that most programmers should learn and it's the one that is far more valuable than all of the others. You may be able to write really fancy algorithms, but if you can't trace down the faults, then you're unlikely to be able to ship them. Code that sort of works, is not very desirable.