Monday, July 20, 2009

Bugs and Misfortune

I was on the hunt again. We were in the middle of testing a web application, a medium sized one, when a couple of simple events occurred.

We had received token errors, as if we had backed up in history and then resubmitted the same form over again. The system protects against this by sending a token to the browser and storing another one in the session. On the submit request, the two tokens are compared, and the request is rejected if they are not identical.

When my companion got the first couple of token messages, I was ready to dismiss them as just possible user errors. We were after all, using several browsers each and while it was possible in IE to get each browser in it's own session, it was also possible to have two sharing the same one underneath. A situation that would easily create token errors, since they are session-based.

This changed however when I logged in from a fresh session, went straight to one of the forms, filled it out, and hit enter. Although I had a few independent instances of IE hanging about, this was a new clean session with nothing to interfere with it. Thus, the error message on screen was an immediate cause for concern.

Even so, we continued with the testing, finishing the day by getting through as many other tests as we could.

Now at this point I might still have chosen to ignore the behavioral problems. They were happening, but they were not directly tied to any specific test.

I had been at a huge unnameable company briefly, earlier in my career and I watched how they passed over some similarly disconnected issues in their testing, just because they were difficult and not repeatable. Their code was plagued with really bad threading problems, but the testing was gratuitously ignoring them. Not a mistake I wanted to make. Not an admirable software product that I wanted to create or use.

The next day we decided to focus a little harder on finding a reproducible set of steps that was causing the problem. Strangely we succeeded on our very first attempt, but then failed to get the error for the entire rest of the day. A not too subtle reminder that this type of bug can be a very annoying type to track down.


ROOTS

In my career, I had been here before. Many times. It's that bug that we know is there lurking there in the background, but forcing it to come forward is difficult and time consuming. On the other hand ignoring it means that it will pop it's ugly head up once every six months or one a year or at some other sparse interval, always coming when it's the most awkward moment. Always causing some big issue and then fading back into the background. One of the thousands of irregular, infrequent bugs that float around us on a daily basis.

I remember tracking down a stack overwriting problem in Mac OS 6. During an error, the C code grabbed values that were originally on the stack, but now that the stack-pointer had long-jumped up to somewhere else, they were getting corrupted by other programs sharing the same space. That cost a painful week in a debugger.

I remember tracking down a compiler code generation issue in VMS, where a C function with 56 arguments was causing the compiler to generate dysfunctional code. Depending on the compile options it would work, then it wouldn't, a seemingly random occurrence. That was worth three or four extremely late night sessions.

I remember guessing that some device on the network was responsible for truncating the HTTP requests and sending them multiple times, even through the network admins swore that no such device existed, at least until they found it later. I found it gradually through hit and miss in a frigid machine room, it took days to thaw out again.

And I can remember hundreds of other seemly impossible bugs, buried quietly in the code, just waiting to cause problems. The really dangerous ones were always the ones that were infrequent and irregular.

Computers are deterministic, right up to the point until they are not. At that level of complexity it can be very difficult to remember that they are simply doing exactly what they are told. Nothing more and nothing less.


MORE ISSUES

We had passed through another day of testing, with just one problem early in the morning and then nothing. The next day we conferred with another of our companions, who had been sitting out of testing because it was mostly his code changes that we were verifying.

For internal system tests, we follow the simple rule that programmers cannot system test their own work. Mostly, if someone writes the code, someone else should write the test and at least one other person should run it. In that way, the differences in perspective have a stronger more likely tendency to show up obvious problems. Problems get found faster with multiple perspectives.

For big external releases, we use independent testers, but for the smaller internal ones we mix and match as needed depending on who did what for each release. A big formal process is OK, in so long as there is a smaller faster parallel one that can be used as well. One "rule" in software development, does not fit them all.

With three of us now looking at the problem, we headed off in a couple of different directions. Our first instinct was that this problem was coming from caching some how. That was an obvious choice, since backing up in history creates the same error message. If the page were cached that would explain it. One of us headed off and started re-examining the underlying configuration of the container framework we were using to execute the code.

Another of us headed straight for the debugger, in the hopes that we could catch this problem in the act. While the third went back through the code, verifying that the usage of the token mechanics, while quite simple, was done correctly. It might have just been a simple coding issue.

After a bit, the caching issues went no where, and a well posted question on Stack Overflow showed that it was unlikely that the problem was coming from caching. Although the settings in the browser cache directory looked suspicious, the server was correctly following conventional wisdom, and setting the reply parameters correctly. It should have been working as coded.

On the other front, the debugger seemed to be paying some dividend. The code was acting strangely, appearing to execute multiple times, even though that actual physical code itself said that was impossible.

Mostly accidentally at this time, we stumbled across a way to repeatably generate the error.

In debugging, when you've finally been able to consistently reproduce the behavior, you know you have the problem beat. In this case we could consistently make it happen.

Now that we had a fixed test, we did the next logical thing. We ran a version of the test on our existing production machine, to see if this problem was in the current release. Amazingly it wasn't, but given the small number of differences between the two systems this wasn't good news.

So we did the next logical thing. We ran a version of this test against a freshly checked out development system. Up until now we had seen this error on our stand-alone test network version, and on our internal demo/test server and also in the debugger on one workstation. Oddly the freshly checked out version also failed to have the same error.

In the combination of these various different points, we had made huge progress. We had a repeatable test, and we had shown that some versions of the system had this problem and some did not. It was now really just a matter of examining all the different little bits that distinguished the different systems, and gradually working through them piece by piece until we find the culprit. A bit of work to be sure, but no longer hard work, just grunt work.

As we started to divide up the possibilities, the first most obvious piece to check was a specific "portable" run-time library that was providing a speed boost to the deployed version. Because it was easy, we undid this enhancement first, and surprisingly the problem disappeared. When we put it back the problem came back again. We un-installed and reinstalled it a couple of times to be sure. That was it. We had our culprit, the question now was why?

Once you know what the problem is, asking the right questions is a lot easier. We continued on for another hour or so, but only to really satisfy our selves that we had the whole bug, not just part of it. On occasion, several bugs will manifest themselves and look to be one, but it can be misleading. We worked through the different scenarios and convinced ourselves that the library was the genuine cause of all of our woes.


BREAKDOWN

Interestingly, it turned out that one of our development workstations was using the library, but several were not. A slight mistake in that we had no good reason to differ our development environments from testing or production, so that was a mistake on our part.

The bug in the library was causing an incoming request in a second connection from the same host to issue off two requests in the container, a most odd type of bug. The second duplicate phantom request occurred only from POSTS, and only from the first one in a session. A rare condition, so it was lucky that we caught this.

The production system was immune, because although the same configuration of the library was there, the code was just slightly different enough not to cause the other session. A small change that we were never really able to pin down exactly, but we had a couple of good guesses.

An upgrade in the underlying library fixed the issue. We were able to insure that it was gone, but just in case we also did another more extensive round of testing to make sure that there were no other similar bugs.

Had this bug gone out in production it would not have been fatal but it would have been very annoying. Users would have kept calling about token problems, and we would have, at first, kept telling them that they were user errors. It's the type of bug that makes users really hate developers, yet it's infrequency would keep making it look as if they were the problem, not the system.

In the end, this was a very small problem in the underlying library, hardly noted in it's change-log. In our system it's effect, while not fatal, was far more significant. We were very susceptible to the misbehavior of some nearly anonymous component buried deep in our system. It was a couple of layers down from our code.

It is too many years of experience with these sorts of problems that tends toward making me shy away from having too many dependencies. The fewer the better.

I know some programmers think it's wonderful to grab a billion libraries and splat glue code at them to wire up a whole massive system in a couple of days of frantic cutting and pasting. That's a nice idea, but our underlying layers are way too unstable to make it anything more than a miserable reality. I'm not interested in quickly releasing unstable applications, I'd prefer to get right down there and make it work properly. If I'm going to build it, I'm going to build it properly.

For most things, the code and the algorithms aren't even that hard, and at this point are fairly well known. If you rely on someone else's implementation, chances are it will be overly complicated and prone to difficult errors. If it's your own code, at least you know what it does and you can go in and fix it properly.

Every time I see discussions and comments about not re-inventing the wheel I figure that the author will have a change of heart once they've been around long enough to get a real sense of what we are actually doing. In youth, we want to develop fast, but as we age, our priorities shift towards wanting to produce better quality, so we don't have to work as hard. I've been down too many death marches and bug fests to not know that the leading cause of failure is almost always sloppy code. Easy come, easy go, I guess.

So many young programmers think it's about getting a release out there uber-quickly, but it's actually about get releases out there, year after year, version after version, without allowing the product to decay into some giant massive bundle of crap. Anyone can do it once, but it takes serious professionals to keep it going for many releases, and it takes serious self discipline to keep it from getting worse and worse with each release.

Still, dependencies are inevitable in anything we write. The expectation for software is sophisticated enough that we no longer have the luxury of being able to build all of it ourselves. And, as a consequence, one very important skill in software development will always be debugging complex problems.

It's the first skill that most programmers should learn and it's the one that is far more valuable than all of the others. You may be able to write really fancy algorithms, but if you can't trace down the faults, then you're unlikely to be able to ship them. Code that sort of works, is not very desirable.

Sunday, July 5, 2009

In Our Dependence

Consider a line of code. Any line really, it doesn't make a difference. From a far off, hazy perspective they are all similar. Just another instruction for the computer to process, in a long sequence of instructions.

At some point, probably many times during it's operational life, the line of code will get executed, performing some simple operation like changing a variable, checking a condition or calling a function.

Most complex languages do allow for significantly more complicated sequences of events to be packed within a single line of code, but even then, each event is really something simple underneath. A simple instruction of some type.

We can consider these complex aggregations to be syntactic conveniences, and not something more advanced. In that sense, a line of code might actually be a clump of lines all compacted together, but for this post we'll ignore that.

We're only interested in looking at a single, simple instruction, and how it relates to the rest of the things around it. For now, we really don't care what the line does, or even how it really interacts with the other lines in the system. We're just concerned with its dependencies.


SETS OF INSTRUCTIONS

For our line of code to work properly a great number of things have to happen first.

Most obviously, the preceding line of code must have executed properly. And before that, all of the other preceding lines must have executed properly too. Each instruction, in a long sequence ofinstructions, has a dependency on the earlier instructions to have worked correctly. An algorithm that is only half successful at the end, is clearly invalid. Completely useless.

In this context, one can view all functionality in a software system as being just fixed sequences of instructions, with some minimal looping. We could lay out all of the lines of code that preceded ours, into a giant flattened list. A complete and final list needed to implement a specific instance of the code.

At some point, deep in the program, this current driving piece of functionality was triggered by some entry point. Something started it running.

All programs, whether they are command line utilities, batch jobs or GUIs, are really just launching pads for similar, related functionality. Our line of code belongs to one or more of these execution sets. The primary difference is whether the entry point was triggered by a human, a set of arguments, or some other mechanism.

Most computer processing relates back to the actions of individuals somewhere, although wiring it up to run at certain times of the day is popular too.


IN CONTEXT

As lines of code execute, they share significant dependencies on the various different "contexts" around them that contain data.

The functionality starts, for instance, with some constrained set of arguments that specifically control its behavior. This might be a set of states in a user application, or it might just be some parameters in a command line. Possibly some user input coming from a dialog or a form. It's a localized context to this specific execution of this specificfunctionality. The functionality context usually maintains the specifics of the execution.

The overall user context may have been a long running session of some type, gradually building up more and more specific information as it grows. Sessions can share data across many different functionality contexts. This can buildup very complex, long running dependencies. In this way, the session context essentially stores the short-term history.

The application itself started up with some preset series of configurations, generally in a file, but they could also be from some persistent data store (database) or possibly command line arguments. Someapplications can even be initialized on the fly from a third party location. Usually this information is consistent between different runs of the application, but not always. An application context is often moreoperationally oriented data, information about how to run in or interact with the environment.

The most significant data is usually the persistent stuff for the actual domain. That long-term data has been building up in the system, often over years. It is the core data that the user is trying to work with. The real stuff.

All functionality must implement some type of domain specific behavior. Software doesn't execute, unless it is trying to provide a specific answer to a specific problem. Generalized code may get reused underneath, but at the highest level, the code that gets executed must be bound to some real, and useful problem. Code runs because people use it for something.

Thus, the underlying persistent domain data is clearly the most influential data in the system. It really defines the success or failure of the computation.

The other contexts generally change the nature and the shape of the instruction sets, where the domain data is directly effected by the instructions themselves. In a real sense, the various contexts combine to control the way the algorithms create or manipulate the core domain data.

Of course, in practice, it is rarely that clean or refined.


A LITTLE BIT OF KNOWLEDGE

Sometimes, depending on the data affected by the specific sequence of code, it can actually be an amalgamation of several bits collected together from multiple contexts.

For instance, the application may have a root path in a configuration file that is then appended with some relative path information from other sources like the database, before finally being completed with some user input for a file name.

If the line of code is to call a function to open a file, for example, the final absolute path for the file may have been assembled from a very long and complex process starting in a huge number of different places in the code base.

In a very real sense, all of those locations, code and data, that are necessary, combine together to be fragments of the same underlying data requirement. They are all directly tied to the success of the code. I.e. the knowledge required to find the final file is scattered throughout the code, theconfigurations files and the database.

It is "partially repeated" in each of the different locations. If any one part fails, the entire handling fails.

Of course, we know that it is a very significant design goal to not have the same pieces repeated multiple times in multiple locations. We've been trying for decades to create programming paradigms to help with this issue.

In some cases, we do this for code, but it is also equally applicable for data, particularly operational context data.

Inevitability something changes, thus increasing the probability that any repeated knowledge (code or data) will fall out of synchronization. Bugs -- the most significant ones -- generally come from different parts of any program failing to coordinate their behaviors. These types of bugs are often the most serious types, and can be difficult and expensive to track down.

Static data, configuration files, caches and data stores, as well as the active contexts, are all locations of "state" related information that needs to be coordinated. Modern "Object Oriented" programming languages compound the problem by also allowing the code itself to have it's own run-time object state as well. The object state has little to do with the actual sequence of instructions running, it is just a by-product of the language decomposition, yet it is another easy place for errors to accumulate.

State, of any type, is always a bad thing. It builds up and changes in the system over time. In that way it becomes very difficult to test the code, because of the wide variety of possible internal states. Complicated logic tied to states are hard to maintain. Entirely state-free code, that is only ever dependent on a single direct input stream is far easier to test.

If the code "remembers" nothing, then a suitable range of inputs can validate its limited behavior in an operational environment. For each thing it remembers, a multiplier happens on the testing effort.


DOWNSTREAM

Clearly, there are a lot of things happening that can affect the execution of our single line of code.

A lot has to go right in order for it to work as expected. Still that is nothing compared to what is actually dependent on it working. The results of our line running, good or bad, affect a huge number of things. Changes in what the line does, can have an even wider effect.

One of the most obvious dependencies is any direct downstream code. Worse off, are the lines farthest away; possibly in another module, possibly in some other system. Failure or changes, could have significant affect on whether or not they continue to work. The code running later, depends on the current results. And that's not just code local to the this line, it's also a huge amount of code in this system and any other system that will need access to the final results at some point.

But that also includes all of the future versions of all of the dependent code as well.

In a philosophical sense, because the line ultimately affects the long-term persistent data, there is an infinite set of dependent lines of code for as long as that data stays in play. Data lasts nearly forever, and is dependent on all those bits of code in the past and upcoming in the future to do the right thing. So long as the data is not being decommissioned, it binds them all together. Something wrong early on propagates it way through all of them, until it is fixed.

There are all sorts of other lines of code, both adjacent, and far away physically and in time, that share significant dependencies on this line of code. This includes building, testing, packaging, installation and operations.

There could be problems with the build scripts that create the system, most are generic, but some are occasionally driven by code itself.

Testing can be adversely effected as well. A change in a specific line of code may force a huge number of tests to be altered. Depending on the depth and binding, the dependencies between the code and the tests may be just as strong as any other line in the system.

Supporting code, sometimes called scaffolding, may not be active in an operational environment, but it is still code by any other name.

Some unit-testing philosophies bind the code to tests at almost a one-to-one ratio, which means that any impact on one side is easily doubled (if not worse) across the whole system.

Packaging and installation could be effected as well. These are often forgotten about, but reasonable installation also includes both the ability to install a new version and to upgrade. Small changes in code can cause massive upgrade and installation issues, particularly if there are new dependencies on previously uncollected data. Upgrade or "migration paths" are also usually the weakest parts of most systems precisely because they aren't used often, are badly written and are hard to test.

Most multi-user software also gets some level of custom integration once it actually goes into an operational environment. There could, for example, be countless lines of code in various scripting languages that are dependent on this line. Scripts to cleanup, monitor or facilitate better integration between different systems. Since the software vendors rarely cooperate with each other, and are usually only selling thin slices of the solution, operational binding is a huge part of any multi-user software's life span.


SUPPORTING ROLES AND DOCUMENTS

Of course, a change to the line of code may also significantly change all of the related documentation as well.

A strong dependency comes from the comments in the source code itself. They describe the code, but there is nothing to enforce that description. Thus comments often quickly fall out of sync with the surrounding behavior. The comments may come in front, next to, behind or batched in a block, but the accuracy depends on how people have kept them up-to-date with the code. A misleading comment, may ultimately waste time. A lot of time.

Changes may cause failure in multiple language translations. There can be a mass of duplicated text and documentation with multiple-language systems. Every piece of text has to exists multiple times, in every language. Keeping that in sync is often a huge anchor on the code. It isorganizationally hard, and often involves expensive re-translations.

If our line of code fails to run as expected, then huge amounts of other documentation and supplementary information could in fact turn out to be wrong as well. These surrounding dependencies rely on things working the way they should. They rely on things not changing often.

With most modern commercial software, the actual code is tiny compared to all of these secondary documentation dependencies. They exist to facilitate the running of the code in different environments. This includes all of the design documents, marketing brochures and help files, but also any manuals and tutorials that get created. Any non-trivial system has (or should have) a significant library of related documents. A simple change to the code could spawn countless hours of related updating work.

Issues may also be propagated out to bug fix systems and support domains. There may be procedures that must change. FAQs that need updating. There are always the first line support tools, but often there are also a huge number of deeper ones. Some explicit, some just ad-hoc.

Various operations personal may be effected as well. There is usually an army of people responsible in most organizations for running and upgrading systems. A single line of code could possibly change their jobs, create huge upgrade projects or just render them obsolete. For some commercial systems, the upgrade procedures themselves are complex enough to have spawned related consulting industries.

In some cases, new people may need to be hired to monitor or augment the code. Software has always created a large number of specific employment positions beyond just the basic development and operational ones. Theorganization may have to change as well, to support using the code. The implementations may need project managers, and other people to keep the systems running smoothly. Domain specific experts may be required to keep it all working properly.

Code, often thought of as only a virtual thing -- ignoring the obvious point that it does "physically" alter magnetic bits in memory and on disks -- reaches out from its machine existence and entangles itself all over the real world, in all sorts of non-obvious ways.


NOT EFFECTED

While we've been looking at what is affected by our line of code, it is also worth looking at what is not.

There will always be lines of code in a system, particularly low-level ones that have some widespread effect across the entire scope of the system. However, a well-written system will actually minimize the amount of code that has these types of dependencies.

System architecture, encapsulation and good functional decompositions go a long way into helping to contain the overall dependencies of any single line of code.

If our line was properly encapsulated into a module, then most of the other lines in the system should be isolated from it.

If the context going into our system was well laid out by the architecture, then only that specific context will effect the running.

If the different contexts and data are logically separated, then it is easy and obvious to know which ones are affected by our line of code.

We should be able to see the transformations in the domain data, caused by the overall algorithm or functionality to which this line belongs. There should be no mysteries, any changes should be clear and obvious.

The code itself should sit at a point in the files and directories of the overall system as to indicate it's place in the design andarchitecture. Everything about the project should help lead to some significantly useful piece of information. Some piece of knowledge. It should all be as obvious as possible.

As a general rule of thumb, any given "experienced" programmer with the code should be able to reasonably and reliably predict the impact of any changes. They should be able to identify all dependent code and data, fairly quickly.

If the code fails that type of impact test, it is absolutely spaghetti code, by any definition. All lines of code should exist and be in a location, where their contribution to the whole system is obvious. Arbitrary or seemingly random lines of code show up serious self-discipline, or structuring problems. Nothing about code should ever be random or arbitrary. It always matters.

And certainly there is no need or purpose in Computer Science for having code that does not behave in a predictable, rational and explainable manner. Even with complex heuristics or models like neural nets, there is always some high level of explainable expected behavior, and the impact of the low-level code is constrained to very specific parts of the system. We cannot (and should not) use systems, if we do not understand what they will really do.


SUMMARY

A single line of code is a remarkable thing. It sits there in a big complex world surrounded by a sea of dependencies. It somehow has to manage to work correctly each and every time it runs, although most of the state of the world around it is gray and fuzzy.

We have so many problems with our systems simply because we do not properly value the significance of all of our code. It's all too common to see reasonable systems fall down because of really badly written supporting scripts for example. It's common to seeconfiguration files composed of arbitrary junk that has been building up for many versions. Many installation scripts barely work, and are often highly fragile.

Some gifted programmers might put a huge effort into getting the core of the system to be well-built, but that doesn't mean much if the other 80% is a mess.

We have all sorts of strategies and rules of thumb about not repeating code over and over again, but oddly few of these apply to the various bits of data that also are necessary. If we only encapsulate one small aspect of the code, the other aspects can still easily become indecipherable. Nicely commented spaghetti code is still spaghetti code, for instance. Quality is defined by the weakest points, not the best ones or even the average.

The idea then, is for all "related" elements (code or data) to be as spatially compacted as possible. The closer things are together, the easier it is to keep them in sync with each other.

The tighter we can truly encapsulate "knowledge" into the system, the more independent the pieces really become. When they can stand-alone, they make it possible for us to utilize them for as much as possible.

If changes in the code must be mirrored by ones in an XML configuration file as well, or vice versa, then the overall structure is repetitive and poorly encapsulated. The "knowledge" is split between two distinct locations. An architecture that declaratively redefines most elements in an external file, is clearly not a good design.

Well-structured code is a far better thing to work with. We should be able to look at a line of code and easily determine its affect on the overall system. Since we don't need chaotic systems (except for study and simulations), there is no reason we should build them.

Testing the overall quality of the code is not that complex of a process. For the most part, the lines in the system should be in obvious places with obvious behaviors. If the system needs some extremely complicated type of documentation in order for a reasonable programmer to make fixes, then it is likely that the structuring is extremely poor.

Well written programs make the solution look simple and obvious. Badly written ones require documentation. Knowing the impact of a change in a well written system is easy. Guessing the impact in spaghetti code is hard. At any given line of code in the program you should both be able to understand what it does, and correctly predict it's affect on everything else. It's that simple.