Wednesday, April 27, 2011

Resilience

When you are developing software, one problem you don’t want is for an error to occur but the processing continues until things get completely scrambled. This makes it really hard to track down the ‘where’, ‘when’ and ‘why’ of the initial error.

With that in mind, we often build systems to stop immediately on the first suspicion result. Doing this makes it easy to find the problem and forces you to fix it before it goes into production. It helps in building better quality at a faster pace.

However useful, this is the worst thing for a real production system. If some small piece is broken, you don’t want the whole system grinding to a halt in the middle of the night. That type of extreme reaction will only create a slew of secondary problems, which will result in a lot more time being consumed than necessary. Since time is a scarce resource in IT, we have to use it wisely.

Systems should be resilient. They should continue working were possible. Actions, data, etc. should be atomic, so that ordering or partial completion doesn’t matter. Failures should be monitored, but routed around, to insure that small problems don’t escalate into major issues. Small problems are the norm for most computing environments.

A lot of thinking has gone into our industry. We have concepts like transactions, and the ACID properties that exist to make sure we sub-divide the processing in a way that failures are isolated. We’ve been building fault tolerant systems for decades. Together, this body of knowledge allows organizations like Google to massively distribute their systems in a way that significantly reduces the impact of hardware or software problems. These are well-understood, well-documented issues.

Clearly since we need both types of error handling, the choice of how to handle errors differs depending on the context. A system in development should stop immediately, while one in production should make every effort to keep going. A well-written system also ‘silos’ its functionality so that problems in one area of the system don’t overflow into others.

All systems should have these two basic modes of operation. This obviously makes coding a little more difficult and adds to the amount of work, but if done well, it can reduce the operational and support efforts. Software always has bugs, so it is important to accept this and build with that in mind. Also, operations personal are users of the system. A well-written system not only makes it easy for the user’s to accomplish their tasks, but it also makes it easy to deploy and keep running. Both goals are important.