Sunday, November 30, 2014

Error Handling

When I first started programming my focus would be on what I'll call the 'main execution path'. That is, if my task was to add a screen with some new features to the system, I would concentrate on getting the data from the persistence layer to the screen with as little distractions as possible. Basically, to simplify the task, I'd ignore all of the external things that might go wrong while the code was running, effectively assuming that the system was living in a "perfect world".

It didn't take long to realize that this was a problem. Internally computers might be deterministic, but more and more they have to interact with the messy chaos of the surrounding world. If you don't, as I quickly realized, then the operational and support issues start to overwhelm your ability to continue to extend the system. That is, if the code is breaking for every little unexpected problem and frequently requires manual intervention, as the system grows the interventions grow faster. At some point the whole thing is completely operationally unstable. The main execution paths might be working perfectly, but a robust system also needs decent error handling.

The reason I think most programmers choose to code for what I call the perfect world scenario is that it is a whole lot easier than trying to write code that intelligently copes with reality. Really good code that can withstand anything that happens is extremely complex and generally has a large number of interwoven execution paths. It takes time to think through all of the corner-cases, and then even more time to organize and structure it properly to be readable.

The pinnacle of engineering for code is to achieve 'zero maintenance'. That is, you install the system and never have to worry about it again. It just runs. If for any reason there are questions about its state, then you can safely reboot the hardware without concern. There is nothing else that needs to be done, except possibly to upgrade to a better version someday.

Zero maintenance software is exceptionally hard to build, but it can really help a big project achieve it goals. Instead of hacking away at an ever-growing sea of ugly problems caused by accidental inter-dependencies, you get the surety of knowing that the foundations are fully encapsulated and reliable. This allows developers to concentrate on moving forward, extending or adding features. Once a problem is solved, we want it to stay solved until we choose to revisit it later. If you have that, then you have the freedom to plan out a reasonable work strategy that won't keep getting push aside because of the next crisis.

Mostly, under modern time constraints it's nearly impossible to get real zero maintenance code, the expectations of the stakeholders no longer allow us to spend enough time perfecting all of the little details. It's unfortunate, but unlikely to change anytime soon. These days I tend to shoot for very low maintenance code, always hoping for a chance to do better.

For any piece of code, it is important to distinguish between what's internal and what's external. The demarcation point is that what's truly internal in any system is only the code that your team has written and is currently maintaining. That's it. If you can't change it, then it is external. This is really important because the first rule of zero maintenance is to never 'trust' anything external. If it's outside of your ability to control, then you have to accept that it will go wrong someday. I first encountered this philosophy when I was young and thought it was overly paranoid. However after decades of watching things fail, and change, and get disorganized, I've got at least one good story for every possible external dependency, including code, configuration, libraries, data, users, administration, etc. all going bad. Nothing, it turns out is really safe from tampering or breaking.

Even internally, you still have to watch out for your own team. Programmers come and go, so it's easy for a new one to misunderstand what was written and weaken it to the point of continual and hard to diagnose failures. Good documentation and code reviews help, but a strong architecture, good working habits and well-written software are the only real defenses.

The first big problem programmers encounter is shared resources, like a persistent databases. If you have to open up a connection to something, then there are days when it will fail. The naive approach is to just let errors get thrown right to the top, stopping the code. That might be okay if the database was down for days because of a major failure, but most often it's really just unavailable for a short time, so the system should wait nicely until it is up again then continue with whatever it was doing. That means that for any and all database requests, there is some code to catch the problem and keep retrying until the database is available again. The first time this occurs, some warning should be logged, and if the problem persists, every so often another message is logged to say that the database still isn't available, but care should be taken not to swamp the log file with too many messages. Depending on what's calling the database access code, this handling should happen forever or it should timeout. If the code is a long running backend process, it should just keep trying. If it's a GUI request, it should eventually return an 'unavailable' message to the user, or gracefully degrade into lessor service (like stale data).

Some errors will require reinitialization, while others may just need to wait before retrying. Any error coming from the connection should be explicitly handled, so that other programmers following can verify the existence of the code. Given the amount of database code that is in most systems, writing this type of proper error handling every time would be way too much work, so this is one place were reuse is absolutely paramount. The retry/reconnect semantics should be encapsulated away from the calling code, with the exception of perhaps one simple flag indicating whether or not this code should 'quit' after some predefined system timeout (and all interactive code should use the same timing).

It's hard enough to structure this type of code for simple queries or updates, but it's a whole other level to make this work across distributed transactions on separate databases. Still, what we want if one of the databases is currently unavailable is to wait for it to come up again and then continue on with the processing. We want the software to remember where it was, and what still needs to be done. We do have to watch for trying up resources though. Say you have a transaction across two independent databases and the second one is down. The whole transaction often needs to be rolled back to avoid unnecessarily holding a 'read' lock on the first database. What works for most circumstances is to put parts of the system into a passive 'waiting' state until the missing resources return.

This works for handling long running backend processing that interact with databases, but there are also shared resources for simpler programs like web applications. These can be files, pipes or even cooperating processes that can fluctuate depending on the chaos of their implementations. Our desire is to not let their problems propagate into our system.

Operationally, because of tools like 'quota' it is even possible for a server to run out of disk space. Again the best approach is for the system to wait patiently until the space has been resized, then continue on with whatever it was doing. You rarely see this in code, usually running out of disk space causes a major crash, but given that the disk is just another shared resource it should be treated the same. RAM however is often treated differently because under most circumstances the OS will 'thrash' long before memory is exceeded. That is, it will be swapping in and out an excessive number of virtual pages to the point that that's all it is really doing anymore. For long running processes that need potentially large amounts of memory, care needs to be taken at all levels in the code to only allocate what is absolutely necessary and to reuse it appropriately. Creating leaks is really easy and often expensive to correct later. It's best to deal with these types of constraints up front in the design, rather than later in panic recoding.

In a simpler sense, the zero maintenance ideas even apply to command line utilities. Two properties that are commonly desired are 'interruptible' and 'idempotent'. The first means that any command can be interrupted at any time while it's running, but even if the timing was awkward that won't prevent it from working when run again. That's extraordinarily hard to accomplish if the command line does work like updating an existing file, but it is a very useful property. If you started something accidentally or it's running slow, you should just be able to kill it without making any of the problems worse. Idempotent is similar, in that if the same command is run multiple times then there should be no ill effects. The first time it might insert data in the database from a file. Each subsequent time, with the same file, it will either do nothing or perform updates instead of inserts. I like the second approach better, because it allows for the file to be modified and then the system to be synchronized to those changes.

For whatever reason the originators of relational databases decided to split writes into two very separate functions: insert and update. Often programmers will use one or the other for their task, but generally zero maintenance code always needs both. That is, it checks first to see if the data set is already in the system, and then adds or modifies it accordingly. Getting that right can actually be very messy. In some cases the 'master' version of the data is in the database, so any edits are applied directly. Sometimes the actual master version of the data exists in an external system. If the edits are coming from that system they obviously take precedence, but if they aren't then it's a problem. The best approach is for any specific data entity to assign one and only one system as the actual master of the data, every other system just owns a read-only copy. For most systems that is fine, but there are several cases where it becomes murky. That should be sorted out long before coding. For really bad problems, to preserve integrity and help catch issues, the data needs to be consistently labeled with a 'source' and a 'time'. In some systems there is also an audit requirement, so the actual user making the insert, update or delete needs to be captured as well. Adding in these types of features may seem like overkill, but the moment something goes wrong, and it always will, being able to correctly diagnose the problem because you haven't just blown over the original data, will be a lifesaver and of course the first and very necessary step towards actually fixing the problem permanently. Sometimes you can make the code smart, other times you just have to provide the tools for the users to work out what has happened.

Persistent databases actually cause all sorts of data clashing problems that are frequently ignored. One of the simplest is 'dirty' reads. Someone starts editing data, but they walk away from the computer for a while. In the meantime, someone else comes in and makes some edits. Nice systems will alert the original user that the underlying data has changed, and really nice ones will present all three version to the user to allow them to manually merge the changes. Mean systems will just silently clobber the second person's data. Some systems will just irratate the user by timing out their editing session. Occasionally you see a system where the first user locks out the second, but then those systems also need tools and special handling to break the locks when it turns out the first user has gone on vacation for a couple of weeks. Mostly though, you see mean systems, where the programmers haven't even noticed that there might be a problem.

The key aspect for zero maintenance is not trusting anything external. That means that any data coming from any source, user interface, database, or external feed from any other system is not to be trusted. Thus it needs very specific code to validate it, and often to validate its cross-dependencies as well. Like the database code, this type of checking would require too much code if it were rewritten for every screen or every data feed. Thus, this is another place were reuse is mandatory. If you have one library of checks for any possible external data entity, then all you need to do is call a validator if you know that the data comes from outside of the code base. Again, dealing with bad data from a feed is different than dealing with bad data from an editing screen, but it should be handled similarly. Given a data entity, which in OOP would be an object, the validator should return a list of errors. If there are none, then it is fine. If there are many, then on the edit screen each bad field can be flagged or if it is batch the errors are combined into a log message. Depending on the data, it might be correct to ignore bad feed data, or the system might fall back to default data, or even sometimes the data might be redirected to an edit queue, so that someone can fix it when they get a chance and then retry. Each approach has its own merits, based on the source of the data and the system requirements.

In some systems, if the incoming data is unavailable, the best choice is to show an error screen. But that is fairly ugly behavior and sometimes the users would rather see stale data. In that case the system should fall back to older data, but it needs to clearly indicate this to the users. For a realtime monitoring system, if the users are watching a table of numbers, the nicest code would colour the table based on the age of the data. If nothing were available for perhaps minutes or hours, the colour would gradually go to red, and the users would always be aware of the quality of what they are seeing. They wouldn't need to call up the operations dept. and ask if if things are up-to-date.

Zero maintenance also implies that any sort of editing requirement by the users is fully handled by the interface. For all entities if the data is in the system it is either read-only or there is a way to add, edit and delete it. There should never be any need to go outside of the interface for any data admin tasks. If the system contains historic data, there are tools available to check and correct it if there are errors. Data can always be removed, but even if it is no longer visible, it still maintains a presence in the log or audit handling. The general idea is to able to empower the users with the full ability to track and correct everything that they've collected. If that is set up well, the programmers can stay far away from the issues of data maintenance, and if the underlying persistence storage is a relational database with a normalized schema, the programmers can avoid 'reporting hell' as well by pushing that work onto some generic reporting tool. The up-front time saved from having the development team avoid data maintenance and reporting can be better spent on perfecting the code base or fully adding in new features.

Getting in really old data to a new system makes it more useful, but really it if often a massive amount of work. Mostly the problems come from modeling, poor quality or fuzzy data merges. What often happens is that programmers skip the temporal aspects of the data, or its original source information. Sometimes they just model only the subset that concerns them at the present moment, and then try to duct tape their way into extending the model. By now you'd think that they were readily establish tools to model and move data around, but except for some very pricey products (which no one ever wants to pay for) most of the ETL code is just hacked individually, for each specific type of data. If the requirements are for a wide ranging collection of historic data, then it makes a lot of sense to design and implement one big general way of back-filling any type of data within the system. It's not trivial to write, but it really is a lot of fun, and it teaches programmers to handle coding issues like abstraction and polymorphism. The idea is to eliminate absolutely every special case possible, while providing an easily configurable interface like a DSL to apply transformations and defaults. Once that type of facility is built, the original programmers can move onto something else, while some other group does the actual migration. It's best to get the right people focused on working on the right problems at the right time.

Another common mistake is confusion about keys. They are really an internal concept. That is a system creates and manages its own keys. Sometimes you get data from other systems with their keys included. Those are external keys, they can't be relied upon and the data needs to be carefully rekeyed internally. Most developers really like creating new numeric keys for all major entities, this not only helps speed up access, but it abstracts away the underlying fields in the key and allows for reasonable editing. Sometimes programmers shy away from creating numeric keys in an effort to save time or thinking, but that's almost always the type of mistake that is heavily regretted later. Either because it is inconsistent with the rest of the schema, or a situation arises where the underlying fields really do need to be edited but also remain linked with other entities. Keeping the inter-relationships between data entities is crucial for most systems, since real world data is rarely flat. Once the data is in the system, it's modeling should be strong enough that it doesn't need to be messed with later which is always a horrifically expensive exercise. Just tossing data in, in any form, and then expecting to fix it later is a recipe for disaster. Dumping unrelated data into arbitrary fields is absolute madness.

Given that zero maintenance systems can be conveniently restarted if there are questions about their internal state, it is also necessary that they are configured to start automatically and that their startup times are minimized. In the middle of a bad problem you can't afford to wait while the code warms up gradually. Most often the delays are caused by faulty attempts at performance optimizations like caching. When implemented correctly these types of optimizations shouldn't require prolonged start up times, they should kick in immediately and gradually get more efficient as time progresses. It's easier to just wait until everything has been slowly loaded into memory, but again that's just another dangerous shortcut to avoid spending the effort to build the code properly.

Eventually all systems need to be installed or upgraded. A zero maintenance system has an installer to make the process trivial. But it also often has the attribute that any two versions of the system can co-exist on the same machine without interfering with each other. Again, really getting that to work is complex particularly when it comes to handling shared resources. For instance, if you have one version installed on a box with an older schema and lots of data, installation of the newer version should be able to create a new schema, then copy and upgrade the old data into it. In fact, at any time, any new version should be able to grab data from an older one. In some cases, the databases themselves are just too large to allow this, but a well-written system can find a way to copy over all of the small stuff, only causing minimal disruption. The key reason this is necessary is that new releases sometimes contain subtle, lurking bugs that go unnoticed for days. If one presents itself, the process to rollback should be easy, safe and quick, but also carry with it any new data for the old schema. That way most of the work isn't lost and the choice to roll back isn't devastating.

One important thing to be able to detect is faulty installations. Once code goes into a production environment, it is out if the hands of the programmers. Strange things happen. It is best for each piece to have clear and consistent version number and for all of the pieces to cross check this and warn if there is a problem. Overly strict code can be annoying, particularly if it is wrong about the problem, so I've always felt that in 'debug' mode the code should stop at the first error it encounters, but in production it should try to muddle through but issue warnings to the log. This avoids the problem were a minor error with a change gets escalated to a major problem that stops everything.

Bugs happen, so zero maintenance accepts that while trying very hard to not let any of them deceive the users. That is, if the data is questionable there needs to be a way to still present it, but flag it. Then it becomes the user's problem to decide whether or not it is serious. It doesn't matter if it really is a data problem or if it is because of a coding bug, the facility to move forward and alert the users should be designed into the system.

Overall, the general idea with zero maintenance is to solve the problem at a deep enough level that the programmers don't have to get involved in operations or any of the user issues with collecting and managing a large set of data. If it's all done well, the system becomes obsequious. That is, it fades into the background and the problems encountered by the users are those that belong to their job, not the technology that they are using to accomplish it. Operations is rarely bothered, and there is just one simple approach to handling bad days. If you think about most software out there, it works fine for a bit but becomes extraordinarily ugly if you try to deviate outside of its main execution path. This provides all sorts of pain and frustration and generally is one of my core reasons for insisting that the current state of software is 'awful'. Computers with well-written software can be powerful tools that help us manage the chaos and complexity of the world around us. What we don't need is an extra level of annoyance brought on by programmers ignoring that reality. A system that works on good days is barely usable, but one that we trust and is zero maintenance is what we really need. And as programmers, just splatting some data to the screen in a perfect world is only an interesting problem to solve for the first few years, but after that it's time to move on to the real challenges.